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Many  military  and  civilian  problems  can  be  viewed  as  pattern  recognition:  given  a  set  of  measured  inputs,  the 
task  is  to  predict  the  corresponding  output.  Typical  examples  range  from  image  recognition  and  classification,  to 
time  series  prediction  and  regression.  Most  modeling  assumes  that  the  inputs  can  be  measured  exactly,  without 
noise.  Building  a  model  then  means  to  construct  (or  “learn”)  a  mapping  from  these  inputs  to  the  expected  values 
of  the  outputs.  The  usually  tacit  assumption  of  noise-free  inputs  is  violated  in  most  real-world  problems  where 
only  a  noisy  version  of  the  “true”  input  is  observed. 

This  research  found  that  while  it  was  possible  for  time  series  problems  even  if  there  is  a  lot  of  noise  present,  to 
use  information  from  adjacent  patterns  in  time,  the  problem  could  be  solved  for  non-time  series  problems,  such 
as  the  phcLse  array  radar  data. 

The  effort  lead  to  several  papers.  Results  are  presented  on  discrete  hidden  states  (Hidden  Markov  models),  and 
continuous  hidden  state  (state  space  models).  A  paper  on  finding  the  true  inputs  using  Independent  Component 
Analysis  is  in  preparation.  A  paper  on  evaluation  methodology  using  the  bootstrap  also  employs  the  state  space 
approach. 
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1.  Objective 

Observations,  such  as  phased  array  radar  data,  contain 
noise,  usually  from  several  sources.  The  essence  of 
modeling,  and  subsequent  inference,  is  to  extract  the 
signal.  The  objective  of  this  grant  was  to  understand  the 
strengths  and  limitation  of  a  new  algorithm  called 
"clearning"  (the  combination  of  learning  the  model  and 
cleaning  the  data) ,  and  to  apply  it  to  phased  array  radar 
data . 


2.  Results 

The  proposal  was  written  with  a  three-year  time  horizon. 
Before  applying  the  algorithm  to  phased  array  radar  data, 
and  comparing  it  to  competing  algorithms,  the  first  goal 
was  to  understand  what  the  algorithm  can  do,  and  what  it 
cannot  do.  This  was  best  done  by  relating  it  to  the 
important  research  question:  to  what  degree  can  we  infer 
hidden  states  from  observed  data? 

The  key  result  is:  hidden  states  can  be  inferred 
successfully  for  time  series  data.  Time  series  data  have  , 
the  major  advantage  that  adjacent  patterns  are  indeed 
related  to  each  other.  This  is  not  the  case  in  standard, 
non-time-series  pattern  recognition  problems. 

The  first  progress  report  emphasized  the  important  of 
constraints  between  the  input  variables  to  exist  for 
clearning  to  work.  In  particular,  it  emphasized  that  the 
first  steps  of  the  project  thus  are  to  clarify  what  might 
be  done,  and  what  cannot  be  done  in  principle,  as  well  as 
to  relate  clearning  to  source  separation,  and,  in  the  case 
of  time  series,  to  state  space  modeling  and  Kalman 
filtering.  This  has  been  achieved:  The  following  describes 
the  research  that  my  collaborators  and  I  carried  out  in  the 
last  year  in  the  context  of  finding  ("hidden")  variables 
(continuous,  as  in  clearning,  or  discrete)  that  are  a  less 
noisy  characterization  of  the  systems  than  a  snapshot  of 
the  raw  observed  signal. 


Shi  and  Weigend  [1]  explore  discrete  hidden  states,  and 
show  their  usefulness  for  characterizing  and  predicting 
very  noisy  time  series.  This  is  an  extension  of  hidden 
Markov  models,  very  popular  in  the  speech  community,  but 
hardly  known  in  the  prediction  community.  The  key  idea  is: 
if  there  are  different  dynamics  in  different  regimes  of  the 
time  series,  and  these  regimes  last  for  a  while,  then 
rather  than  averaging  over  the  submodels,  a  more 


appropriate  model  is  obtained  by  estimating  both  the 
regime,  and  the  parameters  of  the  sub-models. 

The  MATLAB  code  we  wrote  for  these  experiments  is  available 
upon  request . 

The  power  of  hidden  Markov  models  crucially  depend  on  the 
time  series  nature  of  the  problem.  Gleaming,  in  contrast, 
as  well  as  the  "gated  experts”  architecture  (Weigend, 
Mangeas,  and  Srivastava  1996)  do  not  exploit  the  time 
series  structure  and  are  thus  both  more  broadly  applicable 
and  weaker. 


Timmer  and  Weigend  [2]  show  the  power  of  modeling  dynamic 
noise  and  observational  noise  separately.  I  had  mentioned 
previously  (Section  2.1  of  the  progress  report)  that  noisy 
inputs  can  lead  to  an  underestimation  of  the  parameters. 
This  paper  explores  this  point  further  and  shows  that  a 
case  where  the  decay  times  of  shocks  are  underestimated  by 
two  orders  of  magnitude  when  the  distinction  between 
observational  and  dynamic  noise  is  ignored.  While  state 
space  modeling  is  a  powerful  method,  it  crucially  depends 
on  the  time  series  nature  of  the  problem. 


Another  method,  suggested  in  the  progress  report,  is  blind 
source  separation,  related  to  independent  component 
analysis  (ICA) .  In  collaboration  with  Dr.  Andrew  Back  I 
started  to  explore  the  usefulness  of  independent  component 
analysis  (ICA,  also  called  blind  source  separation)  to  very 
noisy  data,  Japanese  stock  return,  in  comparison  to 
principal  component  analysis  (PGA) .  Preliminary  results 
indicate  that  estimated  independent  components  (ICs,  also 
called  "sources")  fall  into  two  distinct  categories:  (1)  a 
small  number  of  large  transient  shocks  (with  skewed 
distributions),  and  (2)  approximately  Gaussian  random 
noise. 


Finally,  the  revision  of  a  third  paper  by  LeBaron  and 
Weigend  [3]  focusing  on  focuses  on  performance  evaluation 
by  re-sampling,  profited  from  the  distinction  of  different 
noise  sources:  the  method  described  in  [2]  was  applied  to 
that  time  series  of  daily  NYSE  volume. 

In  summary,  while  these  papers  received  attention  at 
several  conferences  and  workshops,  and  have  been  accepted 
by  major  journals,  the  answer  to  the  first  stage  of  the 
clearning  question  has,  unfortunately,  been  largely 
negative.  I  currently  do  not  see  a  way  to  extend  the 
algorithm  to  non-time-series  data  as  I  had  hoped:  there 
simply  is  not  enough  information  for  the  degrees  of  freedom 
of  both  moving  the  data  and  the  model. 


3.  Publications 

[1]  Shanming  SHI  and  Andreas  S.  WEIGEND  "Taking  Time 
Seriously:  Hidden  Markov  Experts  Applied  to  Financial 
Engineering."  In:  Proceedings  of  the  lEEE/IAFE  1997 
Gonference  on  Gomputational  Intelligence  for  Financial 
Engineering  (GIFEr,  New  York,  March  1997),  pp.  244 — 252. 
Piscataway,  NJ:  IEEE  Service  Genter. 


http: //WWW. stern.nyu.edu/~aweigend/Research/Papers/HiddenMa 
rkov/ 

Abstract--Most  traditional  time  series  models  are  global 
models  based  on  local  time  information:  they  assume  that 
the  state  can  be  fully  and  locally  (in  time)  characterized 
with  a  finite  embedding  space.  Prediction  then  amounts  to 
simple  regression.  Unfortunately,  there  are  many  situations 
in  which  simple  regression  is  not  sufficient  to  model  the 
temporal  structure  in  a  time  series.  We  here  introduce  an 
architecture  that  we  call  Hidden  Markov  Experts.  It  is 
based  on  Hidden  Markov  Models  used  in  speech  recognition 
research.  By  introducing  the  concept  of  hidden  states. 
Hidden  Markov  experts  model  time  dependency  of  time  series 
explicitly  as  a  first-order  Markov  model  with  transitions 
between  these  hidden  states.  Within  each  state,  local 
models  are  applied  to  estimate  the  probability  density, 
which  can  be  linear  or  nonlinear  depending  on  the 
situation.  This  paper  first  discusses  the  statistical 
framework  and  the  learning  algorithm  of  Hidden  Markov 
experts,  then  applies  them  to  daily  S&P500  data  and  to  high 
frequency  currency  exchange  rate  data.  The  Hidden  Markov 
Experts  have  better  profit  than  the  linear  and  nonlinear 
global  models.  The  volatilities  of  the  time  series  can  be 
characterized  by  the  hidden  states. 


[2]  Jens  TIMMER  and  Andreas  S.  WEIGEND  "Exploiting  Local 
Relations  as  Soft  Constraints  to  Improve  Forecasting." 
Forthcoming  in:  International  Journal  of  Neural  Systems, 
Vol.  8  (1997) . 

http : / /WWW. stern . nyu . edu/~aweigend/Research/Papers /StateSpa 
ce 

Abstract — In  time  series  problems,  noise  can  be  divided 
into  two  categories:  dynamic  noise  which  drives  the 
process,  and  observational  noise  which  is  added  in  the 
measurement  process,  but  does  not  influence  future  values 
of  the  system.  In  this  framework,  empirical  volatilities 
(the  squared  relative  returns  of  prices)  exhibit  a 
significant  amoxint  of  observational  noise.  To  model  and 
predict  their  time  evolution  adequately,  we  estimate  state 
space  models  that  explicitly  include  observational  noise. 

We  obtain  relaxation  times  for  shocks  in  the  logarithm  of 
volatility  ranging  from  three  weeks  (for  foreign  exchange) 
to  three  to  five  months  (for  stock  indices) .  In  most  cases, 
a  two-dimensional  hidden  state  is  required  to  yield 
residuals  that  are  consistent  with  white  noise.  We  compare 
these  results  with  ordinary  autoregressive  models  (without 
a  hidden  state)  and  find  that  autoregressive  models 
underestimate  the  relaxation  times  by  about  two  orders  of 
magnitude  due  to  their  ignoring  the  distinction  between 
observational  and  dynamic  noise.  This  new  interpretation  of 
the  dynamics  of  volatility  in  terms  of  relaxators  in  a 
state  space  model  carries  over  to  stochastic  volatility 
models  and  to  GARCH  models,  and  is  useful  for  several 
problems  in  finance,  including  risk  management  and  the 
pricing  of  derivative  securities. 
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[3]  Blake  LeBARON  and  Andreas  S.  WEIGEND  "A  Bootstrap 
Evaluation  of  the  Effect  of  Data  Splitting  on  Financial 
Time  Series."  Forthcoming  in:  IEEE  Transactions  on  Neural 
Networks,  Vol  9  (1998) . 

http://www.stern.nyu.edu/~aweigend/Research/Papers/Bootstra 

P/ 

Abstract;  This  article  exposes  problems  of  the  commonly 
used  technique  of  splitting  the  available  data  into 
training,  validation,  and  test  sets  that  are  held  fixed, 
warns  about  drawing  too  strong  conclusions  from  such  static 
splits,  and  shows  potential  pitfalls  of  ignoring 
variability  across  splits.  Using  a  bootstrap  or  resampling 
method,  we  compare  the  uncertainty  in  the  solution  stemming 
from  the  data  splitting  with  neural  network  specific 
uncertainties  (parameter  initialization,  choice  of  number 
of  hidden  units,  etc.).  We  present  two  results  on  data  from 
the  New  York  Stock  Exchange.  First,  the  variation  due  to 
different  resamplings  is  significantly  larger  than  the 
variation  due  to  different  network  conditions.  This  result 
implies  that  it  is  important  to  not  over-interpret  a  model 
(or  an  ensemble  of  models)  estimated  on  one  specific  split 
of  the  data.  Second,  on  each  split,  the  neural  network 
solution  with  early  stopping  is  very  close  to  a  linear 
model;  no  significant  nonlinearities  are  extracted. 
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Abstract.  Most  traditional  time  series  models  are  global  models  based  on  local  time  information:  they  assume  that 
the  state  can  be  fully  and  locally  (in  time)  characterized  with  a  finite  embedding  space.  Prediction  then  amounts 
to  simple  regression.  Unfortunately,  there  axe  many  situations  in  which  simple  regression  is  not  sufficient  to  model 
the  temporal  structure  in  a  time  series.  We  here  introduce  an  architecture  that  we  call  Hidden  Markov  Experts. 
It  is  based  on  Hidden  Markov  Models  used  in  speech  recognition  research.  By  introducing  the  concept  of  hidden 
states,  Hidden  Markov  experts  model  time  dependency  of  time  series  explicitly  as  a  first-order  Markov  model  with 
transitions  between  these  hidden  states.  Within  each  state,  local  models  are  applied  to  estimate  the  probability 
density,  which  can  be  linear  or  nonlinear  depending  on  the  situation.  This  paper  first  discusses  the  statistical 
framework  and  the  learning  algorithm  of  Hidden  Markov  experts,  then  applies  them  to  daily  S&PSOO  data  and  to 
high  frequency  currency  exchange  rate  data.  The  Hidden  Markov  Experts  have  better  profit  than  the  linear  and 
nonlinear  global  models.  The  volatilities  of  the  time  series  can  be  characterized  by  the  hidden  states. 

Keywords.  Regime  Switching,  Hidden  States,  Probability  Density  Prediction,  Non-constant  Transition  Probabili¬ 
ties,  EM  Algorithm,  Risk  Estimation,  Decision  Technology. 

Data  sets  used.  High-frequency  DEM/USD  exchange  rates.  Daily  S&P  500. 


1  Introduction 

Basic  linear  time  series  models  are  global  models  based  on  local  time  information  and  are  typically  based 
on  two  assumptions:  (1)  stationarity  or  weak  stationarity  of  the  time  series,  (2)  the  time  series  can  be  fully 
and  locally  (in  time)  characterized  within  a  finite  embedding  space.  However,  many  financial  time  series  are 
certainly  not  stationary.  In  particular,  they  tend  to  have  either  time-varying  means  or  variances  or  both, 
and  for  high  frequency  data,  have  varying  dynamics  during  the  day.  Some  of  these  problems  are  addressed 
for  example  through  the  family  of  autoregressive  conditional  heteroskedastic  (ARCH)  processes,  assuming 
that  the  variance  of  the  time  series  conditionally  depends  on  past  variances. 

An  important  class  of  nonstationarity  is  piece-wise  stationarity  where  the  time  series  switches  between 
different  regions.  Within  the  regions,  the  time  series  satisfy  the  requirement  of  stationarity,  but  between 
them,  they  might  have  different  noise  levels  or  different  dynamics.  Examples  of  such  models  are  threshold 
autoregressive  models  and  stochastic  volatility  models.  Although  a  single  global  model  can  theoretically 
express  any  relationship  including  regime  switching,  it  is  often  very  hard  to  estimate  such  a  global  model 
from  the  data.  Many  architectures  have  been  proposed  to  solve  the  problem  of  regime  switching  (e.g., 
[Jacobs  et  al.,  1991,  Weigend  et  al.,  1995]),  which  decompose  the  global  model  into  modular  local  models 
for  the  regions.  However,  the  key  point  is  how  to  split  the  data  space.  In  these  models,  however,  the  regions 
are  assumed  to  be  independent  of  each  other,  i.e.,  if  we  shuffle  the  patterns  of  the  data  set,  there  will  be  no 
difference  in  the  final  model. ^ 

In  this  paper,  we  use  Hidden  Markov  Experts  to  explicitly  model  the  time  dependency  between  adjacent 

*The  author  is  currently  with  J.P.Morgan  &  Co.  Inc.,  60  Wall  St,  New  York,  NY,  10260 
shi-shanmingQ  jpmorgan .  com 

^In  this  paper,  the  word  pattern  denotes  an  input-output  pair. 


patterns  of  the  time  series.  HMMs  have  been  widely  used  in  the  field  of  speech  recognition  where  context 
is  important  [Rabiner,  1989].  It  can  also  be  used  in  modeling  the  time  dependency  of  regime  switching. 
Related  work  in  this  field  is  Hamilton’s  regime  switching  model  [Hamilton,  1990].  However,  in  Hamilton’s 
work  the  regions  or  the  states  can  be  directly  estimated  from  the  current  observation,  while  in  Hidden  Markov 
Experts,  the  states  are  hidden  from  the  observation  and  depend  on  the  whole  history  of  observations.  There 
are  several  variations  of  Hamilton’s  work  on  regime  switching,  see  Chapter  22  in  [Hamilton,  1994].  They 
however  all  use  linear  predictors,  whereas  we  allow  for  nonlinear  predictors. 

This  paper  is  organized  as  follows:  Section  2  explains  the  basic  idea  and  formalism  of  Hidden  Markov 
Experts.  Section  3  reports  the  results  of  the  experiments  we  carried  out  on  computer-generated  data  (where 
we  know  the  true  segmentation),  as  well  as  on  two  real-world  problems  (high-frequency  exchange  rates 
and  S&P  500).  Section  4  describes  the  extension  of  non-constant  transition  probabilities,  and  Section  5 
summarizes  the  results  obtcdned. 


2  Hidden  Markov  Experts 

Any  maximum  likelihood  approach  needs  several  assumptions.  First,  a  noise  model  has  to  be  chosen;  it 
describes  how  likely  an  observed  data  point  is,  given  the  model’s  prediction.  The  typical  choice  of  minimizing 
the  sum  of  squared  errors  corresponds  to  assuming  Gaussian  distribution  for  the  noise  model.  Second,  a 
choice  about  the  architecture,  model  class  or  functional  form  between  inputs  and  outputs  has  to  be  made. 
Typicsd  examples  are  linear  models  for  regression,  logit  for  classification,  or  general  nonlinear  functions  such 
as  implemented  by  neural  networks.  Their  output  tjq)ically  are  expected  values  (possibly  with  variances)  as 
predictions.  In  standard  regression  or  classification  cases,  any  dependencies  between  patterns  are  ignored. 
The  third  assumption  now  addresses  precisely  these  dependencies  between  patterns:  we  here  use  a  Hidden 
Markov  model  to  model  the  relation  between  adjacent  patterns. 


2.1  Hidden  Markov  Models 

Basic  Idea:  The  observed  sequence  of  observations  is  determined  by  the  underlying  unobservable  stochastic 
process,  the  state  sequence  of  the  HMM,  with  an  emission  probability.  A  Hidden  Markov  model  is  called 
hidden  because  these  states  can  not  be  directly  estimated  from  the  observed  data.  We  also  assume  that  the 
hidden  process  is  a  Markov  process:  the  probability  of  the  next  state  depends  only  on  the  current  state  and 
the  transition  probability  between  the  two  sates.  Both  the  states  and  the  observed  process  can  be  either 
discrete  or  continuous.  For  time  series  modeling,  we  use  discrete  states  (corresponding  to  the  regimes)  and 
continuous  observations  (corresponding  to  the  time  series). 

Notation:  (1)  Observations  (time  series  data):  y  =  ...,2/^},  (2)  States:  S  =  {si,S2,  ...,sm}i 

(3)  Transition  probabilities:  A  =  {aij,i,j  €  M,ay  =  Prob{next  state  =  jjcurrent  state  =  f)},  (4)  Emission 
probabilities;  B  =  {b^j  Q:  M,t  €  T,b^j  =  Pro6(current  observation  =  j/‘|current  state  =  i)},  (5)  Initial 
probabilities  of  each  state:  H  =  6  M}.  For  convenience,  the  notation  A  =  {A,  B,  11}  is  introduced  to 

denote  the  entire  set  of  parameters. 

The  Maximum  Likelihood  Function:  The  central  problem  of  HMMs  is  to  find  the  parameters  A  that 
most  likely  fit  the  observed  data  y.  Under  our  assumption  of  the  time  dependency  between  patterns, 
the  probability  P(j/*|A)  is  not  independent  of  time  t,  for  pattern  at  time  t.  However  the  joint  probability 
P(j/*,  s‘|A)  is  independent  of  time  t.  The  likelihood  P(3^|A)  is  then  given  as 

V  Q  V  Q  V  Q 

where  Q  is  the  state  sequence  corresponding  to  each  pattern,  e.g.  Qa  =  G  S}  and 

P{y^\s^j,X)  =  Py  Therefore,  to  get  the  probability  P(y|A),  two  probabilities  need  to  be  estimated:  One  is 
the  probability  of  current  state,  the  other  is  the  emission  probability  given  the  current  state.  Since  there 
are  combinations  of  different  Q’s,  it  is  very  difficult  to  take  the  derivative  of  equation  (1)  with  respect 
to  A.  An  algorithm  to  do  this  is  called  the  forward-backward  procedure  can  be  used  to  efficiently  calculate 
the  P(3^1A)  [Baum,  1972,  Rabiner,  1989]. 


2.2  Experts:  models  for  emission  probabilities 

Now  we  can  specify  the  architecture  for  emission  probabilities.  If  we  assume  the  input  of  the  emission  model 
is  X  =  {xS  t  e  then  the  emission  probability  bj  =  P{y^\x\s^j,X).  This  is  the  likelihood  of  observing 

data  given  the  current  state  and  the  current  input.  We  call  each  of  the  specified  emission  models  an  expert 
and  each  individual  expert  corresponds  to  one  state. 

The  experts  can  take  on  different  architectures.  For  instance,  we  can  use  a  linear  model  or  a  nonlinear 
model,  such  as  a  neural  network.  Furthermore,  different  experts  can  have  different  sets  of  inputs.  This 
turns  out  to  be  an  important  advantage  that  alleviates  the  effects  of  the  curse  of  dimensionality.  In  this 
paper,  we  are  going  to  use  the  neural  and  the  AR  model  as  the  experts.  Instead  of  emission  probability  B, 
we  have  a  new  set  of  parameters  05  for  the  emission  model.  The  emission  probability  B  can  be  computed 
from  05.  One  can  then  compute  A  =  {A^  05,  11}. 


2.3  Learning  algorithm 

Baum  and  his  colleagues  [Baum,  1972]  proposed  an  elegant  algorithm  called  the  forward-backward  procedure 
to  calculate  P(3^|A).  They  also  introduced  an  EM  (Expectation  Maximization)  algorithm  to  maximize  this 
probability.  We  here  generalize  these  algorithms  to  Hidden  Markov  Experts. 

•  Forward-backward  procedure:  Define  a*  =  P{y^ ,y^,  ...,y^,s\\X)^  where  1  <  t  <  T.  Then 

we  obtain  for  the  probability  P(3^|A)  =  calculated  through  the  recursive 

procedure:  a-  =  TTib]  anda*'^^  =  [Yl^i  ^  This  is  called  the  forward  procedure.  Similarly, 

we  can  define  the  backward  variable  /3|  =  A).  With  the  recursive  induction 

Pf  =  1  and  Pi  =  where  t  =  T  -  l,r  -  2, 2, 1,  we  can  get  all  the  p  for  each  t 

The  reason  we  need  this  backward  procedure  is  to  use  the  whole  observed  sequence  to  estimate  the 
probability  P(s^|A).  With  a  and  /?,  we  can  determine  the  7-  defined  as  P(5^|3^,  A) 

,  ^  p{yAW  ^  ^  a\p\ 

P(3^|A)  P(3^|A) 

The  probability  7I  can  be  used  as  the  estimation  of  P(5f|A),  since  it  is  the  best  we  can  do  given 
the  whole  observation  sequence.  Similarly  an  auxiliary  probability,  =  P(sf,  A),  can  also 

be  computed  with  a  and  p  as 

•  EM  algorithm:  In  the  expectation  step,  the  probability  a  and  and  in  turn,  the  posterior  7 
and  ^  for  each  t,  are  calculated  based  on  the  current  estimation  of  A  according  to  (2)  and  (3). 

In  the  maximization  step,  we  update  the  A  =  {A^  05, 11}  according  to  tt^  =  7/  and 

_  expected  number  of  transitions  from  state  i  to  j  Ylt 

expected  number  of  transitions  from  state  i  (to  anywhere) 

For  each  emission  model,  maximizing  equation  (1)  is  the  same  as  maximizing  the  following: 

P{y,  5*  =  h  V  t\x,  A)  =  P{y'\^\  s),  eihj  (4) 

where  the  0^  represents  the  parameters  of  the  emission  model  of  state  j.  Equation  (4)  or  its 
negative  logarithm  can  be  viewed  as  a  cost  function  for  the  emission  model.  The  way  to  compute  the 
parameter  0;^  depends  on  the  format  of  the  experts  and  the  assumption  of  the  noise.  For  example, 
in  our  experiment  we  assume  the  error  to  be  Gaussian  distributed  and  use  neural  networks  as 
experts.  Equation  (4)  is  then  similar  to  weighted  sum-squared  error  and  the  parameter  0^  includes 
the  weights  and  bias  of  the  network,  as  well  as  the  variance  of  the  data  in  state  j. 
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2.4  Making  prediction 


For  financial  engineering,  the  key  question  is  how  to  make  predictions  with  Hidden  Markov  Experts.  In 
prediction,  we  cannot  use  Equation  (2)  to  estimate  the  state,  because  it  includes  future  information.  How¬ 
ever,  given  the  sequence  of  observations  up  to  now,  we  can  estimate  the  probability  of  state  in  terms  of  the 
transition  probabilities  aij  and  a  as 


f(»,-  to  . ■>!««) 

The  expected  value  of  the  prediction  then  becomes 


(5) 


3  Experiments 

We  present  one  computer  simulated  data  set  and  two  real  world  data  sets,  which  are  the  high  frequency 
Olsen  foreign  exchange  data  and  the  S&P500  daily  data.  For  the  real  world  data,  we  compute  the  profit 
based  on  the  sign  of  the  prediction,  i.e.,  buy  if  the  sign  of  the  prediction  of  the  next  return  is  positive  and 
sell  if  the  sign  of  the  prediction  of  the  next  return  is  negative.  Each  data  set  is  split  into  a  training  set  and 
a  test  set.  All  the  results  given  are  obtained  on  the  test  set. 


3.1  Computer  simulated  data 

To  convince  ourselves  of  the  applicability  of  the  idea,  we  generate  a  time  series  that  switches  between  a 
trending  and  a  mean  reverting  process  with  i.i.d.  Gaussian  innovations.  The  diagonal  transition  probabilities 
of  the  transition  matrix  are  an  —  0.98,  and  022  =  0.97.  The  trending  process  is  =  0.2r^  +  0.8  A7(0, 1), 
and  the  mean  reverting  process  is  =  -0.15r*  +  0.5  A/'(0, 1).  These  two  are  high  noise  processes  where 
the  signal-to-noise  ratio  is  0.04  and  0.0225  respectively. 

Two  AR  experts  are  used  in  this  experiment.  In  contrast  to  real  world  data,  we  know  the  true  segmentation 
of  the  data  set.  The  experiment  shows  that  the  hidden  Markov  experts  recovered  the  regimes  and  the 
parameters  including  the  variances  of  the  two  processes.  Figure  1  shows  the  segmentation  of  one  expert 
on  out-of-sample  data  compared  to  the  true  segmentation.  Table  1  gives  the  statistics  of  the  parameter 
estimations:  the  transition  probabilities  A,  the  AR  coefficients  «,  and  the  standard  deviations  of  Gaussian 
noise  a  over  20  different  runs. 


Figure  1:  Regimes  found  by  one  expert  compared  to  the  true  segmentation  on  out-of-sample  simulated  data. 
The  solid  line  shows  the  true  value  used  when  the  data  was  generated,  the  dotted  line  the  causal  prediction 
of  the  model. 


Gil 

022 

«1 

«2 

<^2 

true  value 

0.980 

0.970 

0.200 

-0.150 

0.800 

0.500 

mean  of  fitted 

0.981 

0.971 

0.193 

-0.140 

0.804 

0.498 

std  of  fitted 

0.002 

0.004 

0.015 

0.026 

0.006 

0.008 

Table  1:  Summary  of  the  experiments  on  the  computer  simulations.  For  the  transition  probabilities  on  the 
main  diagonal  an,  the  AR  coefficients  ku  and  the  noise  levels  cr,  we  give  the  true  values  as  well  as  the  mean 
and  standard  deviation  of  their  estimates  through  20  runs  of  the  model. 


3.2  High  frequency  foreign  exchange  data 

The  first  real  world  data  set  is  part  of  Olsen’s  DEM/USD  foreign  exchange  rate  data  based  on  a  variable 
time  scale,  called  ■d  time,  instead  of  fixed  intervals  of  physical  time  [Dacorogna  et  al.,  1996].  We  model  these 
data  with  three  experts  all  using  five  lagged  values  of  the  time  series  as  the  input  to  predict  the  next  vmue. 
We  compare  the  results  with  a  global  linear  model  (the  AR  model)  and  a  global  nonlinear  model  (a  _fe^- 
forward  neural  network).  The  neural  network  has  one  linear  output  and  10  tanh  hidden  units.  The  training 
set  contains  1000  points  from  19/05/95  17:58  to  09/06/95  14:32.  The  test  set  contains  1000  pomts  from 
09/06/95  14:41  to  29/06/95  23:54.  The  Hidden  Markov  Experts  are  trained  for  one-step  ahead  predictions 
(i.e.,  half  an  hour  in  i?  time). 

Figure  2  shows  the  results  on  the  test  set.  The  top  panel  gives  the  data,  the  central  panel  the  absolute  values 
of  the  price  returns,  and  the  bottom  panel  shows  the  segmentation  of  the  three  experts  for  the  Olsen  test 
data  Comparing  the  lower  two  panels  note  that  the  first  expert  tends  to  take  the  regimes  with  relatively  low 
volatility,  the  second  expert  tends  to  take  the  regimes  with  relatively  high  volatility,  and  the  third  expert 
takes  care  of  the  outliers. 

When  we  estimate  the  parameters  of  Hidden  Markov  Experts,  we  also  estimate  the  variance  of  each  expert. 
They  are  plotted  as  a  function  of  training  time  in  Figure  4.  Note  that  the  variance  of  the  first  expert  is  only 
about  a  quarter  of  the  variance  of  the  second  expert.  This  can  be  associated  with  low  and  high-volatility 
regimes.  The  third  expert  rarely  gets  activated  in  response  to  large  returns. 

Figure  3  shows  the  profit  and  loss  curves  of  the  Hidden  Markov  Experts  after  50  training  iterations,  in 
comparison  to  a  simple  feed-forward  neural  network,  linear  regression,  and  to  a  simple  “short-and-hold 
method.  The  Hidden  Markov  Experts  have  the  highest  profit  over  the  period  of  the  test  set. 
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Figure  2:  Panel  1  of  this  figure  is  the  test  set  of  the  Olsen  data.  Panel  2  shows  the  absolute  vdues  of 
the  return  of  the  price.  Panel  3  shows  the  regimes  found  by  each  experts  for  the  Olsen  data  We  plot 
the  responses  of  all  three  experts  in  this  panel  with  different  offsets.  We  can  see  that  the  first  expert  is 
responsible  for  the  low  volatility  regions,  while  the  second  expert  is  responsible  for  high  volatility  repons, 
and  the  third  expert  acts  as  a  collector  for  the  outliers.  This  result  is  consistent  with  the  variances  of  each 
expert,  given  in  Figure  4. 


An  important  feature  of  this  architecture  is  that  it  gives  the  entire  probability  density  of  the  return-not  just 


248 


Profit  and  Loss 


Figure  3:  Profit  and  Loss  for  Olsen  data.  This  figure  shows  that  the  Hidden  Markov  Experts  give  the 
highest  profit  in  comparison  to  the  pure  linear  regression  and  the  feed-forward  neural  network.  Transaction 
cost  are  not  taken  into  account. 


a  prediction  of  its  expected  value.  An  example  is  given  in  Figure  5,  where  the  solid  line  is  the  mixture  and 
the  dashed  lines  are  the  individual  Gaussians.  The  full  density  can  subsequently  be  used  in  the  computation 
of  risk  [Weigend  et  al.,  1997]. 


3.3  Daily  S&P500 

The  second  real  world  experiment  compares  the  results  of  the  linear  Hidden  Markov  Experts  with  nonlinear 
Hidden  Markov  Experts  (using  neural  networks  as  experts).  Two  experts  have  been  used  for  both  the  linear 
and  nonlinear  Hidden  Markov  Experts.  Each  neural  network  expert  has  one  linear  output  unit  and  10  tanh 
hidden  units.  The  training  set  spans  from  01/12/73  to  12/31/86  and  the  test  set  spans  from  01/02/87  to 
12/29/94.  Figure  6  shows  the  S&P  500  data,  the  daily  returns,  and  the  segmentation  found  by  nonlinear 
Hidden  Markov  Experts.  The  plotted  regime  corresponds  to  the  low  volatility  regions,  its  complement  to 
high  volatility  regions.  The  Figure  7  shows  the  profit  and  loss  curves  of  the  different  methods.  The  nonlinear 
Hidden  Markov  Experts  have  better  profit  than  the  linear  Hidden  Markov  Experts  and  the  AR  model  in 
the  test  period. 


4  Time-varying  Transition  Probabilities 

We  extend  the  work  described  so  far  by  relaxing  the  assumption  of  the  being  constant:  we  assume  that 
the  transition  probabilities  vary  depending  upon  some  external  inputs.  This  is  a  reasonable  assumption  for 
modeling  the  complex  financial  market  and  the  influence  of  different  kind  of  economic  indicators. 

Our  extension  can  be  compared  to  Bengio  and  Frasconi’s  “Input  Output  Hidden  Markov  Model,”  using 
a  recurrent  mixture  of  experts  to  estimate  the  transition  probabilities  [Bengio  and  Frasconi,  1996].  The 
recurrent  architecture  and  many  parameters  lead  to  problems  in  convergence  and  overfitting.  The  problem 
is  that  we  have  no  target  for  the  transition  probabilities.  Here  we  provide  a  new  and  simpler  way  to 
implement  the  idea  of  nonstationarity  based  on  Hidden  Markov  Experts. 

Instead  of  directly  estimating  the  transition  probability  P(s^|sp,  we  can  model  the  time  dependency  of 
the  joint  probability  According  to  Equation  (3),  we  can  compute  the  posterior  probability  — 

A)  with  a  and  /?  for  each  time  t.  Then  with  Bayes  rule,  we  obtain  the  transition  probability 
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Figure  4:  Learning  curve  of  the  variance  for  each  expert  for  the  Olsen  data.  We  can  see  that  the  first  expert 
is  responsible  for  the  low  volatility  regions,  while  the  second  expert  is  responsible  for  high  volatility  regions 
(similar  to  the  variance  of  the  time  series).  The  third  expert  acts  as  a  collector  for  the  outliers.  (The  data 
is  normalized  to  unit  variance.) 


at  time  t 


A- 


(6) 


In  order  to  estimate  the  transition  probability  with  extra  inputs,  the  architecture  can  be  either  a  linear 
model  or  a  nonlinear  model.  The  target  of  this  local  model  is  at  time  t.  Since  is  a  probability, 
we  can  use  softmax  to  meet  the  constraint.  We  have  applied  this  model  to  computer-generated  data  with 
known  time- varying  transition  probabilities.  The  architecture  and  algorithm  correctly  estimates  the  time 
dependency  of  the  transition  probabilities. 


5  Conclusions 

This  paper  introduced  the  theory  and  architecture  of  Hidden  Markov  Experts.  We  presented  several  exper¬ 
iments  on  financial  time  series.  The  key  results  are: 

1.  We  can  find  clean  segmentation  into  a  small  number  of  experts.  There  is  no  way  of  determining 
an  “optimal”  number  of  experts  from  first  principles  and  the  data.  For  real  world  problems,  this  is 
one  of  the  degrees  of  freedom  in  modeling. 

2.  We  show  that  the  segmentation  can  be  interpreted  in  terms  of  volatility.  We  carried  out  tests  ran¬ 
domly  flipping  the  sign  of  the  returns,  yielding  similar  segmentation.  However,  for  these  randomized 
returns  the  profit  disappears  as  expected. 

3.  We  compare  the  Hidden  Markov  Experts  to  linear  models  and  to  a  simple  buy-and-hold  strategy. 
On  all  data  sets  we  tested,  both  profit  and  Sharpe  Ratio  are  better  for  the  Hidden  Markov  Experts 
than  for  the  benchmarks. 

4.  We  also  compare  Hidden  Markov  Experts  with  nonlinear  emissions  models  to  those  with  linear  emis¬ 
sion  models.  When  properly  controlled  for  overfitting,  the  nonlinear  emission  models  outperform 
the  linear  ones. 

5.  We  extend  the  standard  framework  of  constant  transition  probabilities  to  conditional  transition 
probabilities. 

6.  An  important  application  for  this  architecture  is  risk  management.  The  algorithm  gives  the  proba¬ 
bility  density  of  the  return-not  just  a  prediction  of  its  expected  value!  The  density  is  expressed  as 
a  mixture  of  Gaussians  and  can  be  used  in  the  computation  of  various  risk  measures. 
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Mixture  Density 


Figure  5:  Predicted  probability  density  function  of  the  returns  for  a  specific  half-hour  prediction  of  the  test 
set  of  the  Olsen  data.  The  solid  curve  in  this  figure  shows  the  mijcture  of  the  three  Gaussian  densities. 
The  individual  densities  of  each  state  are  shown  as  dashed  curves.  The  circle  corresponds  to  the  mean 
prediction,  and  the  solid  dot  is  the  target.  The  individual  mean  of  each  expert  is  shown  by  the  x’s.  (We 
did  not  normalize  the  figure  to  integrate  to  unity;  the  curves  are  just  proportional  to  the  density.) 
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Figure  6:  The  first  panel  is  the  S&P500  data,  the  second  panel  shows  the  returns,  the  third  panel  is  the 
state  of  the  low  volatility  regions. 
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Figure  7:  This  figure  shows  the  profit  and  loss  of  the  S&PSOO  data.  The  nonlinear  Hidden  Markov  Experts 
(nonlinear  HMEs)  have  better  profit  than  the  linear  Hidden  Markov  Experts  (linear  HMEs)  and  the  AR 
model. 
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Abstract.  In  time  series  problems,  noise  can  be  divided  into  two  categories:  d3mamic  noise 
which  drives  the  process,  and  observational  noise  which  is  added  in  the  measurement  process, 
but  does  not  influence  future  values  of  the  system.  In  this  framework,  empirical  volatilities  (the 
squared  relative  returns  of  prices)  exhibit  a  signiflcant  amount  of  observational  noise.  To  model 
and  predict  their  time  evolution  adequately,  we  estimate  state  space  models  that  explicitly  include 
observational  noise.  We  obtain  relaxation  times  for  shocks  in  the  logarithm  of  volatility  ranging 
from  three  weeks  (for  foreign  exchange)  to  three  to  five  months  (for  stock  indices).  In  most  cases,  a 
two-dimensional  hidden  state  is  required  to  yield  residuals  that  axe  consistent  with  white  noise.  We 
compare  these  results  with  ordinary  autoregressive  models  (without  a  hidden  state)  and  find  that 
autoregressive  models  underestimate  the  relaxation  times  by  about  two  orders  of  ma^itude  due  to 
their  ignoring  the  distinction  between  observational  and  dynamic  noise.  This  new  interpretation 
of  the  dynamics  of  volatility  in  terms  of  relaxators  in  a  state  space  model  carries  over  to  stochastic 
volatility  models  and  to  GARCH  models,  and  is  useful  for  several  problems  in  finance,  including 
risk  management  and  the  pricing  of  derivative  securities. 

Data  sets  used.  Olsen  &  Associates  high  frequency  DEM/USD  foreign  exchange  rates  (8  years). 
Nikkei  225  index  (40  years).  Dow  Jones  Industrial  Average  (25  years). 
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1  Introduction 


Modeling  and  predicting  the  volatility  of  financial  time  series  has  become  one  of  the  central  areas 
in  finance  and  trading;  examples  range  from  pricing  derivative  securities  to  computing  the  risk  of  a 
portfolio.  Volatility  is  usually  predicted  using  generalized  autoregressive  conditional  heteroskedas- 
tic  (GARCH)  models;  Bollerslev,  Engle  and  Nelson  (1995)  guide  through  the  GARCH  literature, 
and  Engle  (1995)  collects  some  of  the  key  papers. 

Here  we  present  an  alternative  to  GARCH  that  models  the  underlying  dynamics  using  a  state 
space  model.  This  allows  us  to  describe  the  hidden  process  in  terms  of  variables  natural  for 
a  dynamic  system,  such  as  decay  times  for  shocks,  its  spectrum,  and  the  dimensionality  of  the 
underlying  process.  Stochastic  volatility  models  (see  Shephard  (1996)  for  a  review)  are  a  variant  of 
the  general  state  space  approach  presented  here.  They  differ  in  that  the  mapping  from  the  hidden 
variable  to  the  observed  variable  is  nonlinear.  The  interpretation  developed  in  this  article  can  also 
be  helpful  for  understanding  and  characterizing  stochastic  volatility  models. 

This  article  is  organized  as  follows:  Section  2  discusses  observational  noise  and  dynamic  noise, 
and  reviews  intuitions  and  interpretations  for  linear  systems,  important  for  understanding  the 
results  in  physical  terms,  such  as  decay  times  of  volatility  shocks.  Section  3  defines  and  explains 
the  formalism  of  state  space  models.  Variations  and  interpretations  that  are  typical  in  finance  and 
in  econometrics  are  given  in  Section  4.  Section  5  describes  the  three  data  sets  used  for  the  empirical 
studies.  The  results  are  presented  in  Section  6,  and  the  effect  of  ignoring  existing  observational 
noise  on  the  model  is  discussed  in  Section  7.  Section  8  summarizes  the  findings  and  discusses  some 
of  the  applications  of  this  approach  for  noisy  time  series  in  finance. 

2  Some  background  concepts 

2.1  Observational  noise  and  dynamic  noise 

In  time  series  modeling,  one  crucial  question  is  whether  or  not  observational  noise  is  present  in  the 
data.  Observational  noise  of  a  high  level  can  pose  a  severe  problem  if  it  is  not  treated  properly, 
leading  to  models  that  underestimate  the  functional  relation  between  past  and  future  values.  A 
typical  example  of  such  observational  noise  is  when  an  astronomer  observes  a  star:  fluctuations 
in  the  atmosphere,  or  a  subway  train  passing  by  and  shaking  a  telescope  that  points  to  the  star, 
will  not  influence  the  dynamics  of  the  star.  In  contrast,  a  noise  component  that  does  influence  the 
dynamics  of  a  system  is  called  dynamic  noise.  For  example,  in  an  autoregressive  process,  the  noise 
truly  moves  the  state  (sometimes  also  expressed  as  “the  noise  drives  the  system”),  and  subsequent 
values  are  derived  from  that  moved  state. 

This  article  focuses  on  discrete  time  dynamics,  typically  modeled  by  difference  equations  or 
maps.  The  distinction  between  observational  noise  and  dynamic  noise  is  also  important  for  con¬ 
tinuous  time  dynamics,  typically  modeled  by  differential  equations. 
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2.2  Interpretations  of  linear  systems 


To  facilitate  the  interpretation  of  state  space  models  (introduced  in  Section  3),  we  first  review 
autoregressive  processes  without  observational  noise,  and  characterize  them  from  several  perspec¬ 
tives.  A  simple  way  of  generating  a  time  series  is  through  an  autoregressive  (AR)  process  of  order  p, 
AR[p]  (Yule  1927,  Priestley  1981,  Oppenheim  and  Schafer  1989) 

P 

x{t)  =  ^  ai  x{t  —  i)  -}-  e{t)  ,  (1) 

i=l 

where  e(t)  denotes  an  uncorrelated  Gaussian  distributed  random  variable  with  mean  zero  and 
constant  variance  Through  the  eyes  of  a  physicist,  such  a  process  can  be  interpreted 

as  a  combination  of  relaxators  and  damped  oscillators  (Honerkamp  1993).  The  simplest  case  is  an 
AR[1]  process 

x[t)  —  ax{t  —  1)  -I-  e{t)  .  (2) 


It  can  be  characterized  in  the  time  domain  as  a  relaxator  by  an  exponentially  decaying  impulse 
response,  proportional  to  exp(— t/r),  with  the  relaxation  time 

1 


T  =  - 


logo 


(3) 


After  this  time,  the  amplitude  of  an  impulse  will  have  decayed  to  1/e  or  37%  of  its  initial  value. 

In  the  frequency  domain,  an  AR  process  can  be  interpreted  as  a  filter  responding  to  white 
noise.  The  power  spectrum  of  an  AR[1]  process  drops  off  with 


Sioj)  = 


(4) 


|1  _ ae"*“P  l-t- _ 2 cosu 
For  an  AR[2]  process,  there  are  two  qualitatively  different  cases,  depending  on  the  values  of  the 
parameters.  We  can  always  rewrite  a  single  AR[2]  model  as  a  set  of  two  AR[1]  models  using  the 

transformation  , 

A=(“‘  1  .  (5) 


Its  eigenvalues  _ 

characterize  the  behavior  of  the  AR[2]  process.  If  the  eigenvalues  are  real  (af/4  -t-  02  ^  0),  the 
AR[2]  process  can  be  characterized  as  the  superposition  of  two  relaxators,  and  the  spectrum  drops 
off  monotonically  with  increasing  firequencies,  again  with  decay  constants 


ri  = 


1 

logAi 


(i  =  l,2) 


(7) 


If  the  eigenvalues  axe  complex,  the  AR[2]  process  describes  a  resonance,  corresponding  to  a  hump 
in  the  spectrum.^  In  both  cases,  the  spectrum  is  given  by 

"  |l-aie-“-a2e-2*“P  '  ^  ^ 


iFor  a  damped  oscillator  (the  case  of  complex  eigenvalues),  the  parameters  can  be  expressed  through  the  char- 
acteristic  period  T  and  the  relaxation  time  r  as 

ai  =  2cos 

03  =  -exp(-2/T) 
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By  increasing  the  model  order,  an  AR[3]  process  can  combine  a  relaxator  with  an  oscillator,  and 
an  AR[4]  process  can  describe  two  oscillators,  etc. 

Despite  the  simplicity  and  multiple  interpretability  of  AR  models,  not  all  processes  in  the  world 
are  linear  autoregressive.  Examples  of  generalizations  without  hidden  states  consist  of  including 
past  q  driving  noise  terms  in  the  dynamics,  yielding  an  autoregressive  moving  average  ARMAjp,  q] 
processes,^  as  well  as  including  nonlinearities.^  Here  we  extend  autoregressive  models  in  a  different 
direction,  by  allowing  for  a  hidden  state.^  The  next  section  introduces  the  notation  and  gives  the 
formalism  of  state  space  modeling. 


3  Formalism  of  linear  state  space  models  (LSSM) 

In  Eq.  (1)  the  x{t)  served  two  roles:  it  was  the  variable  that  was  observed,  and  it  was  the  variable 
in  which  the  dynamics  was  expressed.  However,  there  are  processes  where  the  dynamics  cannot 
be  observed  directly  because  it  is  masked  by  observational  noise.  Thus,  no  direct  map  exists  from 
the  observed  data  to  the  state.  This  requires  the  notion  of  a  hidden  state.  In  terms  of  notation,  we 
keep  the  letter  x  as  the  variable  that  contains  the  dynamics,  and  use  y{t)  for  the  observed  variable. 
The  state,  characterized  by  the  vector  x{t),  captures  all  the  information  needed  to  characterize  the 
system  at  time  t. 

The  key  to  state  space  modeling  is  to  split  the  noise  into  two  parts: 


•  dynamic  noise  €(i)  that  drives  the  evolution  of  the  hidden  state,  and 

•  observational  noise  r]{t)  that  is  a  non-explainable  additive  contribution  to  the  measured  y{t). 

These  contributions  have  been  discussed  in  intuitive  terms  in  Section  2.1.  Their  formal  role 
can  be  seen  by  observing  how  they  enter  the  two  equations  that  describe  a  linear  state  space  model 
(LSSM): 


x{t)  =  Af(^-l)  +  €(^),  e(i)€A/'(0,Q)  (9) 

y{t)  =  Cx{t)+7]{t),  7]{t)  e  J^{0,R)  .  (10) 

Eq.  (9)  describes  the  dynamics.  Eq.  (10)  maps  the  dynamics  to  the  observation  and  includes  the 
observational  noise  7]{t). 

As  in  the  case  of  the  observable  linear  autoregressive  model,  discussed  in  Section  2.2,  describing 
the  process  via  physical  quantities  can  yield  important  insights.  The  spectrum  of  a  LSSM  is  given 

2  While  for  theoretical  reasons  ARMA[p,p  -  1]  should  be  preferred  to  AR[p]  processes  for  modeling  of  sampled 
continuous-time  processes  (Phadke  and  Wu  1974),  we  find  that  in  practice,  differences  in  the  results  are  small. 

^The  linear  mapping  given  by  Eq.  (1)  can  be  generalized  to  become  a  nonlinear  mapping.  Note  that  this  is  fully 
within  the  autoregressive  framework  and  amounts  to  simple  regression.  Nonlinear  approaches  include  radial  basis 
functions  (Casdagli  1989,  Moody  and  Darken  1989,  Poggio  and  Girosi  1990),  neural  networks  (Lapedes  and  Farber 
1987,  Weigend,  Huberman  and  Rumelhart  1990),  and  nonparametric  kernel  methods  (Tjostheim  and  Auestad  1994). 

"^This  article  explores  the  idea  of  a  continuoits  hidden  state,  characterized  by  a  scalar  x(t)  or  a  vector  x{i).  The 
dynamics  is  expressed  in  terms  of  that  unobserved  state,  and  the  state  is  subsequently  mapped  to  the  (conditional 
expectation  of  the)  observed  quantity.  In  contrast,  Hidden  Markov  models  (Rabiner  1989,  Fraser  and  Dimitriadis 
1994,  Hamilton  1994,  Bengio  and  Frasconi  1995,  Shi  and  Weigend  1997)  assume  the  hidden  state  to  be  discretei 
for  each  of  these  hidden  states,  there  is  an  “agent”  or  “expert”  (e.g.,  expressed  as  an  autoregressive  model)  that 
generates  the  next  data  point.  This  introduces  a  second  level  of  dynamics  that  is  described  by  the  transitions 
between  the  hidden  states.  This  level  of  dynamics  is  absent  in  a  pure  autoregressive  framework. 
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by 


5H=C(l-Ae-’")-^Q((l-Ae‘“)-^)^C^  +  i?  .  (11) 

The  superscript  (O^  denotes  transposition.  The  spectra  of  AR  processes,  Eq.  (8),  are  a  subset 
of  Eq.  (11).  Note  that  LSSM  spectra  include  shapes  that  ceuanot  be  generated  by  AR  processes. 
An  important  example  of  such  a  shape  is  a  spectrum  where  for  low  frequencies  the  power  drops 
similarly  to  an  AR[1]  process  (see  Eq.  (4)),  but  for  higher  frequencies  the  power  remains  constant 
and  does  not  continue  to  fall,  as  an  AR  model  would  require  it  to.  This  can  be  interpreted  as 
a  low-frequency  process  whose  spectral  energy  decreases  as  the  frequency  increases,  until  it  is 
masked  by  a  noise  floor  of  a  noise  source  with  a  flat  spectrum;  This  low-frequency  signal  above  a 
flat  noise  floor  is  the  crucial  spectral  signature  of  a  LSSM  that  cannot  be  emulated  by  an  ordinary 
autoregressive  model. 

While  parameter  estimation  in  AR  models  is  well  established  (e.g.,  by  the  Burg  or  the  Durbin- 
Levinson  algorithms),  it  is  more  cumbersome  in  the  case  of  state  space  models  .  A  standard 
approach  uses  the  expectation  maximization  (EM)  algorithm  (Dempster,  Laird  and  Rubin  1977), 
a  general  iterative  procedure  for  estimating  parameters  for  models  with  hidden  variables.  In  the 
El-step,  it  is  assumed  that  the  parameters  of  the  model  are  known,  and  the  hidden  variables  are 
estimated.  In  the  M-step,  the  estimates  of  the  hidden  variables  are  taken  literally  and  the  values 
of  the  parameters  are  adjusted.  This  approach  was  first  applied  to  LSSM  by  Shumway  and  Stoffer 
(1982). 

Specifically  for  the  case  of  the  LSSM,  the  first  E-step  starts  from  the  initial  values  of  the 
parameters  A,Q,C,J?,  and  estimates  the  hidden  dynamic  variable  f(i)  using  a  Kalman  filter. 
With  the  following  definitions 

•  :=  the  predicted  value  of  a  quantity  z{t)  based  on  the  data  i/(l), 

•  “  the  covariance  matrix  of  the  estimated  x{t),  and 

•  A{|t/  :=  the  variance  of  the  prediction  errors  {y{t)  —  yt\t')} 

the  equations  for  the  Kalman  filter  are  (Kalman  1960,  Gelb  1974,  Sorenson  1985,  Harvey  1989,  Aoki 
1990,  Bomhoff  1994,  Hamilton  1994,  Mendel  1995): 


12 

t\t-i 

= 

(12) 

A 

iji-l 

+  R 

(13) 

K 

(14) 

= 

(15) 

z= 

Axt- 

i\t-i 

(16) 

yt\t-i 

= 

-1 

(17) 

St\t 

= 

+  K{y{t) 

(18) 

There  is  a  crucial  difference  between  the  first  four  equations  and  the  last  three.  The  first  four 
equations,  Eq.  (12-15),  do  not  contain  the  data,  they  only  describe  relations  between  the  param¬ 
eters  A,Q,C,R,n,A,  and  K.  Their  purpose  is  to  find  the  value  of  K  (the  Kalman  gain)  that 
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subsequently  enters  Eq.  (18).  K  gives  the  appropriate  weight  to  the  added  term  originating  in  the 
error  between  the  actual  observation  y{t)  eind  prediction 

For  true  prediction,  i.e.,  when  y{t)  has  not  yet  been  observed,  Eq.  (16)  has  to  be  used  for  the 
unobserved  state  variable,  and  Eq.  (17)  for  the  observable.  For  model  parameter  estimation,  on 
the  other  hand,  the  entire  training  data  can  be  used,  and  an  improved  estimate  of  Xt\N  can  be 
obtained  by  the  following  three  equations  (Harvey  1989): 


B 

^t\N 


(19) 

®t|t  +  B(xt+i|Ar  “  Aft|t) 

(20) 

(21) 

This  concludes  the  E-step. 

In  the  subsequent  M-step,  the  parameters  A,  Q,  C,  R  are  updated;  an  example  of  the  derivation 
of  the  equations  can  be  found  in  Honerkamp  (1993).  The  iterative  model  fitting  process  ends  when 
a  convergence  criterion  is  met.  This  concludes  the  description  of  how  the  model  parameters  are 
updated  in  the  M-step. 

Once  a  model  has  been  built,  its  quality  can  be  evaluated  by  several  different  criteria,  including: 


•  Predictive  accuracy.  True  out-of-sample  predictions  are  generated  using  Eq.  (17)  on  a  test 
set  that  comes  after  the  training  period.  The  accuracy  of  the  predictions  can  be  compared 
to  competing  models  by  different  evaluation  criteria,  such  as  squared  errors  or  robust  errors. 

•  Whiteness  of  the  prediction  errors.  The  model  should  explain  all  temporal  correlations 
in  the  data:  a  perfect  model  takes  the  signal  and  turns  it  into  white  noise.  Statistically,  the 
question  is  whether  we  can  reject  (at  a  certain  level  of  significance)  the  null  hypothesis  that 
the  residuals  are  uncorrelated.  Following  Brockwell  and  Davis  (1991),  we  use  a  Kolmogorov- 
Smirnov  test  to  determine  whether  the  periodogram  of  the  residuals  is  consistent  with  a  flat 
white  noise  spectrum. 

•  Generating  data  from  the  model.  The  distribution  of  a  certain  feature  can  be  derived 
from  realizations  of  the  model  and  compared  with  that  feature  as  directly  computed  from 
the  observed  data. 


For  linear  models,  two  additional  criteria  are  useful: 

•  Behavior  in  the  time  domain  (relaxation  times).  The  parameters  in  linear  models  are 
related  to  relaxation  times  of  the  corresponding  oscillators  and  relaxators.  When  the  relax¬ 
ation  times  are  too  small  (of  order  of  one  time  step),  they  usually  only  fit  noise,  indicates 
that  the  order  of  the  model  is  too  large. 

•  Behavior  in  the  frequency  domain  (spectrum).  The  spectrum  of  the  linear  process  can 
be  computed  from  the  parameters  of  the  estimated  models  through  Eq.  (11).  Since  the 
spectrum  of  the  model  should  correspond  to  the  expectation  of  the  periodogram  of  the  data, 
comparing  the  spectrum  to  the  periodogram  is  another  important  qualitative  criterion. 
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The  suggestions  listed  here  axe  just  some  of  the  useful  general  criteria  that  will  be  used  in  this 
article.  For  any  specific  problem,  there  are  additional,  more  specific  smoke  alarms  and  sanity 
checks. 

4  Applications  of  state  space  models  to  finance 

This  section  discusses  two  common  applications  of  state  space  models  in  financial  data,  and  com¬ 
pares  them  to  our  approach.  For  simplicity  of  notation,  this  discussion  is  written  for  the  case  of  a 
scalar  x(t): 

x{t)  =  ax{t  -  1)  -t-  e(t)  (22) 

y{t)  =  cx{t)+r]{t)  .  (23) 

The  dynamic  equation,  Eq.  (22),  is  characterized  by  the  single  AR[1]  coefficient  a;  e{t)  is  the 
djmamic  noise  that  drives  the  d3mamics.  The  observation  equation,  Eq.  (23),  maps  the  unobserved 
state  x{t)  to  the  observed  variable  by  scaling  it  with  c.  The  added  observational  noise,  does 
not  enter  the  dynamics. 

4.1  Smoothing 

The  first  approach  splits  the  variance  and  results  in  a  smoother  series.  It  can  be  interpreted  as  a 
method  for  trend  estimation.  Here,  parameter  a  is  not  estimated  from  the  data  to  characterize  the 
dynsunics  (as  in  our  approach),  but  rather  set  to  unity.  Without  loss  of  generality  we  can  also  set 
c  to  unity,  yielding 

x{t)-x{t-l)  =  e{t)  (24) 

y(t)  =  xit)+rj{t)  .  (25) 

Eq.  (24)  interprets  e{t)  as  the  first  difference  of  the  series.  Reducing  the  variance  of  e(f)  by  moving 
some  of  it  onto  7j(i)  results  in  a;(f)  as  a  smoothed  version  of  y{t).  The  variance  of  the  original  data 
y{t)  is  thus  decomposed  into  observational  noise,  and  a  smoother  signal,  x{t).  This  can  be 
expressed  in  a  Bayesian  framework  as  a  prior  on  the  smoothness  of  the  time  series,  as  discussed  by 
Kitagawa  and  Gersch  (1996).  Note  that  Eq.  (24)  resembles  Brownian  motion.  However,  it  is  not 
to  be  interpreted  that  way  here,  but  as  a  smoothing  constraint  for  the  undisturbed  signal  instead. 
The  smaller  e{t),  the  smoother  x{t). 

This  smoothing  approach  is  taken  in  most  state  space  applications  in  finance.  Bolland  and 
Connor  (1996)  add  to  this  approach  a  second  non-constant  part  that  is  a  linear  function  of  the 
difference  of  the  last  two  values  of  the  state.  This  is  effectively  adding  a  constraint  on  the  second 
differences  (curvatures)  of  x{i),  in  addition  to  the  first  differences.  Moody  and  Wu  (1996),  Moody 
and  Wu  (1997a),  and  Moody  and  Wu  (1997b)  use  two  variations  of  the  simple  smoothing  model 
with  a  =  1,  and  use  the  term  “true  price”  for  the  smoothed  version  of  the  observed  prices. 

4.2  Variable  parameter  AR  processes 

The  second  variation  of  the  state  space  model  also  uses  the  state  equation  to  model  a  slowly  varying 
quantity  as  in  Eq.  (24),  but  the  interpretation  of  the  observation  equation  changes  substantially. 
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(26) 


The  constant  c  from  Eq.  (23)  is 'replaced  by  y{t  —  1).  The  equation  then  becomes 

y{t)  =  x{t)y(t  -  1)  4-  T]{t)  , 

representing  an  AR[1]  process.  a;(t)  has  become  an  autoregressive  parameter  that  slowly  varies  with 
time,  and  the  former  observational  noise  7]{t)  now  acts  as  dynamic  noise  (Wells  1996),  whereas  we 
assume  the  parameters  that  characterize  the  system  are  constant  over  time. 

4.3  Modeling  noisy  linear  systems 

The  two  cases  above  do  not  do  justice  to  the  dynamic  structure  of  Eq.  (22).  In  contrast,  this  article 
focuses  on  estimating  the  full  hidden  dynamics  from  the  data.  This  allows  us  to  characterize  the 
process  as  a  linear  damped  system  of  relaxators  and  oscillators,  driven  by  dynamic  noise,  and 
observed  through  a  veil  of  added  observational  noise. 

In  the  econometric  literature,  stochastic  volatility  models  have  been  used  to  describe  the  dy¬ 
namic  structure  of  returns,  see  Shephard  (1996)  for  a  recent  review.  In  the  notation  of  the  present 
article,  a  stochastic  volatility  model  can  be  expressed  as 

x{t)  =  00-1-  aix{t  -  1)  -f-  e{t)  (27) 

y{t)  =  r](t)  exp(i(i))  .  (28) 

The  idea  behind  using  exp(a;(t))  is  to  model  the  skewed  distribution  of  squared  returns  found 
for  the  empirical  data.  Parameter  estimation  in  this  model  is  cumbersome  due  to  the  log-normal 
distribution  of  exp(a:(t)).  It  is  usually  based  on  the  generalized  method  of  moments,  quasi-likelihood 
estimation  or  Markov  chain  Monte  Carlo  methods.  In  contrast  to  stochastic  volatihty  models,  we 
apply  a  static  transformation  to  the  data  that  will  be  introduced  in  the  next  section  in  order  to 
make  the  distribution  of  squared  return  approximately  normal.  This  allows  us  to  use  as  standard 
maximum  likelihood  framework  for  the  parameter  estimation. 

5  Data 

This  article  reports  results  on  the  following  data  sets; 

•  High  frequency  DEM/USD  foreign  exchange  rates.®  We  began  with  eight  years  of  data 
(through  June  29,  1995)  spaced  apart  30  minutes  in  i?-time  (Theta-time).  We  dropped 
all  points  with  missing  values,  and  then  took  every  fourth  of  the  remaining  points  for  our 
analysis,  effectively  downsampling  to  two  hours  in  ?9-time.®  i?-time  removes  daily  and  weekly 
seasonality:  time  of  day  with  a  high  mean  volatility  are  expanded,  and  times  of  day  and 
weekends  with  low  volatility  are  contracted  (Dacorogna,  Gauvreau,  Muller,  Olsen  and  Pictet 
1996). 

®We  thank  Michel  Dacorogna  (Olsen  &  Associates,  Zurich)  for  the  high  frequency  DEM/USD  exchange  rate 
data. 

^Whether  half-hour  or  two-hour  intervals  in  i?-time  are  taken  does  not  change  the  results  reported  here,  since 
the  time  scale  of  the  dynamics  that  we  find  is  two  orders  of  magnitude  slower  than  the  sampling  interval.  However, 
if  we  were  to  have  changed  the  sampling  interval  by  a  larger  factor,  note  that  Brown  (1990)  shows  for  S  &  P  500 
Index  futures  that  the  estimated  (unconditional)  volatility  decreases  by  13%  as  the  sampling  interval  is  changed 
from  one  minute  to  one  hour. 
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Figure  1:  This  figure  displays  a  six-month  window  of  the  high  frequency  foreign  exchange  data, 
sampled  at  two  hour  intervals  in  t?-time.  The  top  panel  shows  the  prices,  the  middle  panel  shows 
the  relative  returns,  and  the  bottom  panel  shows  the  series  used  in  our  analysis,  i.e.,  after  applying 
the  logarithm  and  scaling  it  to  zero  mean  and  unit  variance. 

The  top  panel  of  Fig.  1  graphs  the  level  of  DEM/USD  for  the  first  half  of  1995.  Its  periodogram, 
shown  in  the  left  panel  of  Fig.  2,  drops  to  first  approximation  as  the  spectrum  of  a  random  walk 
whose  1/P  line  is  also  indicated.  (/  denotes  the  frequency.)  The  signature  of  observational  noise— 
a  noise  floor  masking  the  signal  at  high  frequencies — is  absent:  the  periodogram  continues  to  drop 
^We  thank  Mono  Yoda  (Nikko  Securities,  Tokyo)  for  the  Nikkei  225  stock  index  data. 

®The  Dow  Jones  Industrial  Average  data  set  is  described  in  LeBaxon  and  Weigend  (1997)  and  available  through 
uvw . stern . nyu . adu/'awoigend/Research. 
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to  the  highest  time  scale.  The  result  is  that  price  levels  p{t)  of  financial  instruments  do  not  exhibit 
significant  observational  noise;  all  the  “noise”  on  prices  is  dynamic,  i.e.,  it  re-enters  the  dynamic 
equation. 

The  central  panel  of  Fig.  1  shows  the  difference  of  the  logarithm  of  the  price  levels 

logp(t)  -  logp(t  - 1)  =  log  «  p(t-i)  •  (29) 

This  quantity  can  be  interpreted  as  the  logarithm  of  the  geometric  growths,  i.e.,  as  the  logarithm  of 
the  ratio  of  the  prices.  Using  the  fact  that  the  logarithm  Taylor  expands  around  1  as  loge  «  IH-  e, 
it  can  also  be  interpreted  as  the  returns  normalized  by  the  levels,  i.e.,  the  relative  returns.  Note  in 
the  central  panel  of  Fig.  1  that  the  width  of  the  “band”  varies  over  time;  regimes  with  larger  shocks 
(positive  or  negative)  alternate  with  regimes  with  smaller  widths.  The  corresponding  periodogram 
of  the  relative  returns  is  shown  in  the  right  panel  of  Fig.  2.  Note  that  it  is  essentially  flat:  the 
subsequent  returns  on  the  two-hour  time  scale  in  i?-time  appear  to  be  (linearly)  uncorrelated. 

To  exploit  this  observed  structure  in  the  absolute  values  of  the  relative  returns,  we  square  the 
relative  returns,  i.e.,  ignoring  their  signs.  The  distribution  of  the  squared  returns  is  very  skewed. 
To  make  it  less  skewed,  we  take  their  logarithm, 

y(l)=Iog[logj|(^]’  .  (30) 

The  logarithm  of  the  squared  relative  returns,  y{t),  is  shown  for  the  DEM/USD  data  in  the  the 
bottom  panel  of  Fig.  1. 

The  squared  relative  returns  can  be  interpreted  as  independent  realizations  of  a  random  variable 
with  a  slowly  changing  mean.  If  the  relative  returns  loZP{t)/p(t  -  1)  were  normally  distributed 
with  unit  variance,  their  squares  would  follow  a  distribution.  The  variance  of  this  x^  distribution 
is  twice  its  mean,  implying  that  the  realizations  are  very  noisy  indeed!  This  is  the  source  of  the 
observational  noise  for  volatility.  On  empirical  data,  it  is  well  known  that  the  relative  returns 
\ogp{t)lp{t  -  1)  are  not  normally  distributed,  but  have  fatter  tails.  However,  the  spirit  of  the 
explanation  for  the  observational  noise  still  applies;  see  also  Diebold  and  Lopez  (1995). 

Fig.  3  shows  this  eflFect.  The  periodogram  of  the  data  contains  most  of  its  power  at  low 
frequencies.  Subsequently,  as  the  frequency  increases,  it  begins  to  drop.  Finally,  it  flattens  out 
as  the  signal  gets  masked  by  this  “observational  noise,”  stemming  from  the  noisy  realizations  of 
the  slowly  changing  means  of  the  squared  returns.  Note  the  absence  of  a  daily  or  weekly  peak  in 
this  periodogram:  while  present  for  data  in  chronological  time,  it  has  been  successfully  removed  by 
Olsen’s  projection  of  the  data  onto  T^-time.  This  periodogram  is  similar  to  figures  in  Schnidrig  and 
Wiirtz  (1995)  and  in  Andersen  and  Bollerslev  (1997).  However,  neither  of  these  papers  interpret 
the  signature  as  evidence  for  observational  noise,  nor  do  they  use  a  state  space  model  to  explain 
the  data. 

The  key  features  of  the  periodogram— a  drop  over  many  orders  of  magnitude  for  price  levels, 
a  roughly  constant  level  for  returns,  and  a  low  frequency  signal  disappearing  into  observational 
noise  at  higher  frequencies  for  squared  returns — hold  for  all  the  financial  data  sets  we  analyzed, 
including  six  other  currencies  on  different  time  scales,  as  well  as  several  stock  indices.  The  next 
section  gives  detailed  results  for  DEM/USD  and  Nikkei  225,  as  well  as  brief  results  for  the  Dow 
Jones  industrial  index. 


6  Results 


Table  1  summarizes  the  results  for  the  high  frequency  DEM/USD  data,  comparing  linear  state 
space  models  with  ordinary  AR  models.  The  linezir  state  space  models  differ  crucially  from  the 
AR  models  in  the  decay  times  r:  while  the  decay  times  of  the  state  space  models  are  significant, 
they  are  negligible  for  the  AR  models  where  the  processes  typically  decay  within  one  time  step. 
Since  the  state  space  model  is  fitted  to  y{t)  as  defined  in  Eq.  (30),  the  decay  times  characterize 
when  the  logarithm  of  the  squared  relative  returns  has  decayed  to  37%  of  its  initial  value. 

For  first  order  models  describing  a  single  relaxator,  there  is  a  huge  difference  in  decay  time 
Periodogram  of  Prices  Periodogram  of  Relative  Returns 


Figure  2:  Periodogram  of  the  DEM/USD  prices  (left),  and  of  the  relative  returns  (right).  Expressed 
in  1/time,  the  leftmost  points  correspond  to  1/(8  years),  1/(4  years),  l/(2.6  years),  1/(2  years). 
To  guide  the  eye,  we  also  plotted  the  1//^  drop  in  spectral  power  of  a  random  walk  over  six  orders 
of  magnitude.  The  periodogram  of  the  returns  on  the  right  hand  side  is  essentially  flat.  Neither 
the  prices  nor  the  returns  indicate  the  presence  of  observational  noise,  in  contrast  to  Fig.  3. 
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between  156  time  steps  for  the  LSSM  in  contrast  to  an  insignificant  0.45  time  steps  for  the  AR 
process.  The  eigenvalues  of  the  second  order  models,  given  by  Eq.  (6),  turn  out  to  be  real; 
the  process  thus  corresponds  to  the  superposition  of  two  relaxators.  The  slower  one  of  the  two 
relaxators  settles  to  around  240  of  the  2-hour  steps  and  corresponds  to  20  days,  whereas  the  slower 
AR  relaxator  still  decays  in  a  single  time  step.  Using  third  and  fourth  order,  oscillators  emerge 
whose  resonance  frequencies  1/T  correspond  to  about  one  day.  They  might  indicate  a  tiny  amount 
of  periodicity  left  after  the  transformation  of  the  raw  data  to  i?-time,  but  they  do  not  contribute 
significantly  to  the  dynamics  since  their  relaxation  times  are  of  the  order  of  a  few  time  steps  only. 


Figure  3:  Periodogram  (“-f”)  of  the  DEM/USD  exchange  rates,  and  spectra  of  the  estimated  state 
space  models  (SSM)  of  order  one  (dashed  line)  and  higher  orders  (solid  lines). 
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Models 

DEM/USD 

r  (decay  times) 

1  step  =  2  hours 

AR  coefficients 

Prob 

white 

noise 

%MS 

LSSM{1) 

156 

0.994 

0 

0.960 

LSSM{2) 

240,  1.09 

1 

0.996  0  \ 

0  0.399  ) 

0.72 

0.957 

LSSM{Z) 

236 

[T  =  17  T  =  1.6] 

1 

0.966  0  0  \ 

0  0.507  -0.188 

0  0.188  0.507  / 

0.57 

0.957 

LSSM{4) 

243, 1.1 

[T  =  8.5  r  =  10] 

1 

0.996  0  0  0 

0  0.411  0  0 

0  0  0.666  -0.612 

0  0  0.612  0.666  yi 

1 

0.70 

0.957 

0.45 

0.107 

0 

0.988 

0.85,  0.64 

0.100  0.066 

0 

0.984 

AR{Z) 

1.25  ■ 

[T  =  6.5  r  =  0.86] 

0.097  0.062  0.044 

2e-6 

0.982 

0.095  0.058  0.039  0.049 

7e-4 

0.979 

Table  1:  Results  for  the  volatilities  of  the  DEM/USD  exchange  rates.  While  linear  state  space 
models  (LSSM)  of  order  two  and  above  fit  the  data  well,  ordinary  AR  models  cannot  explain  the 
structure  of  the  data. 


The  decay  constants  presented  here  are  defined  for  the  logarithm  of  the  squared  relative  re¬ 
turns.  Nonlinear  transformations  do  not  allow  for  an  amplitude-independent  interpretations  of 
decay  times  in  general.  However,  fitting  state  space  models  directly  to  the  absolute  or  squared 
relative  returns  (without  taking  the  logarithm)  yields  similar  decay  constants.  This  implies  that 
our  characterization  also  hold  for  stochastic  volatility  models. 

The  fourth  column  in  Table  1  shows  that  the  residuals  of  the  state  space  model  of  order 
one  are  not  consistent  with  white  noise,  implying  that  a  first  order  LSSM  does  not  describe  the 
data  adequately.  However,  all  higher  order  LSSMs  produce  residuals  consistent  with  white  noise 
at  a  significance  level  of  0.05  for  the  Kolmogorov-Smirnov  test  on  the  whiteness  of  the  residuals 
(Brockwell  and  Davis  1991).  None  of  the  residuals  of  the  AR  models  are  consistent  with  white 
noise.  This  is  another  indication  that  AR  models  axe  not  an  adequate  model  class  for  volatility. 

The  last  column  gives  the  normalized  mean  squared  error,  EnmS)  between  the  observed  y{t) 
and  the  predictions  obtained  via  Eq.  (17).  Whereas  for  LSSM,  the  error  drops  quickly  to  a  constant 
level  of  0.957  at  order  2,  it  decreases  for  AR  models  at  a  much  slower  rate,  and  also  remains  at  a 
higher  level.  In  an  AR(IO)  model,  for  example,  Enms  takes  the  value  of  0.973,  still  significantly 
above  the  value  of  the  second  order  LSSM. 

We  now  turn  to  the  power  spectra.  The  curves  in  Fig.  3  are  the  power  spectra  of  the  state 
space  models.®  They  axe  computed  using  Eq.  (11).  There  is  a  clear  difference  between  the  first 
order  spectrum  and  the  higher  order  spectra.  The  higher  orders  (^  2)  are  very  similar,  indicating 
that  the  second  order  state  space  model  is  indeed  sufficient.  The  spectra  of  the  state  space  models 

®The  spectra  and  the  periodogram  are  normalized.  For  the  lowest  200  frequencies,  all  periodogram  points  are 
plotted.  Above  this  frequency,  they  are  logarithmically  thinned  out  for  the  sole  reason  to  keep  the  files  reasonably 
small  for  the  on-line  version.  The  visual  impression  in  the  printed  version  does  not  change. 
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correspond  well  to  the  periodogram  of  the  data.  Note  that  the  spectra  are  not  obtained  by  some 
direct  smoothing  of  the  periodogram  in  frequency  space,  but  axe  the  spectra  of  the  state  space 
models  which  were  fitted  in  the  time  domain. 


Models 

Nikkei225 

r  (decay  times) 

1  time  step  =  1  day 

Prob  of 
white  noise 

%MS 

LSSMil) 

63.1 

0.004 

0.906 

LSSmI"!) 

81.8, 1.45 

0.56 

0.905 

LSSM{Z) 

81.2 

[T  =  8.7  r  =  6.9] 

0.64 

0.905 

LSSM(A) 

81.7,1.46 
[T  =  8.4  T  =  10] 

0.57 

0.905 

0.54 

0 

0.975 

AR{2) 

1.20,0.82 

0 

0.959 

AR{Z) 

1.85 

[T  =  6.6  r  =  1.05] 

7e-7 

0.951 

AR{A)  . 

2.93, 1.59 

[T  =  4.15  r  =  1.61] 

0.002 

0.940 

Table  2:  Results  for  the  volatilities  of  the  Nikkei  225  stock  index.  While  linear  state  space  models 
of  order  two  and  above  fit  the  data  well,  ordinary  AR  models  cannot  explain  the  structure  of  the 
data. 

The  results  for  the  second  data  set,  the  logarithm  of  absolute  values  of  the  relative  changes 
of  the  daily  Nikkei  225  level,  are  summarized  in  Table  2.  The  key  point  is  the  large  decay  time 
of  about  3  1/2  months,  revealed  by  the  state  space  models  of  order  two  and  above,  as  well  as  the 
failure  of  AR  models,  very  similar  to  the  DEM/USD  data  set  discussed. 

The  third  data  set,  the  logarithm  of  absolute  values  of  the  relative  changes  of  the  daily  Dow 
Jones  Industrial  Index,  reveals  a  decay  time  of  117  days  or  about  5  months.  In  that  case,  a  one 
dimensional  hidden  state  already  generates  residuals  that  are  consistent  with  white  noise.  As  in 
the  other  two  examples,  no  ordinary  AR  model  in  the  observed  variable  explains  the  data.  This 
effect  will  be  clarified  in  the  next  section. 

7  Ignoring  observational  noise 

The  failure  of  AR  models  shown  in  the  previous  section  is  a  consequence  of  the  observational  noise 
that  is  present  in  the  volatility  data.  Whereas  linear  state  space  models  include  the  observational 
noise  explicitly  in  the  model,  autoregressive  models  assume  that  the  data  is  free  from  observational 
noise.  We  use  a  simple  first  order  process  to  demonstrate  the  consequences  of  ignoring  observational 
noise  on  the  autoregressive  parameter. 

In  an  AR[1]  model,  x{t)  =  ax{t  -  1)  H-  e(t),  the  parameter  a  can  be  estimated  without  bias  as 

^  x{t  ~  /oi'v 

If,  however,  the  dynamics  is  covered  by  observational  noise 

yit)  =  x{t) +  T]{t),  7?~A/'(0,iI)  ,  (32) 
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the  expected  value  (denoted  by  <  •  >)  of  2,  estimated  in  analogy  to  Eq.  (31)  from  y{t),  now  becomes 


<  yjt  - 1)  yj^)  >  _ 

<yit-i)  yit-i)>  i  +  R/<x{t)^> 


(33) 


Thus,  the  larger  the  variance  R  of  the  observational  noise,  the  worse  the  parameter  a  will  be 
underestimated.  This  effect  is  known  from  linear  regression  as  the  problem  of  errors-in-variables 
(Puller  1987).  It  was  first  mentioned  in  time  series  context  by  Kostelich  (1992),  see  also  Konig  and 
Timmer  (1997).  The  underestimation  of  the  functional  relation  between  past  and  present  values 
carries  over  to  more  general  models,  including  nonlinear  models  (Carroll,  Ruppert  and  Stefanski 
1995,  Weigend,  Zimmennann  and  Neuneier  1996). 


8  Summary  and  Applications 

This  article  showed  the  important  distinction  between  observational  and  dynamic  noise.  When  ob- 
servational  noise  is  present,  an  autoregressive  approach  cannot  model  the  data  adequately — a  state 
space  approach  is  needed  to  capture  the  hidden  dynamics.  In  finance,  neither  prices  nor  returns 
tend  to  have  observational  noise.  However,  volatilities  do  exhibit  signature  of  observational  noise 
in  the  periodogram:  for  low  frequencies,  there  is  structure  above  the  noise  fioor  of  observational 
noise. 

We  showed  on  three  representative  financial  data  sets  that  a  linear  state  space  model  with  full 
dynamics  can  describe  volatilities  well.  We  also  showed  that  the  resulting  models  can  be  nicely 
interpreted,  both  from  the  perspective  of  physics  as  a  superposition  of  two  simple  relaxators,  and 
from  the  perspective  of  finance  as  volatility  clustering  with  a  decay  time  of  about  three  weeks  (for 
DEM/USD),  3  1/2  months  (for  Nikkei  225),  and  5  months  (for  Dow  Jones  Industrial  Average). 
These  results  axe  in  strong  contrast  to  AR  models  that  ignore  observational  noise  and  consequently 
have  a  bias  toward  too  small  coefficients,  as  shown  in  Section  7.  The  more  promising  modeling 
approach  using  state  space  models  over  AR  models  for  volatility  suggests  several  applications  in 
financial  markets,  including 

•  Estimating  risk.  Knowing  the  evolution  of  the  volatility  is  important  for  determining  the 
risk  associated  with  a  position  on  a  financial  instrument:  the  volatility  can  be  interpreted  as 
the  conditional  standard  deviation  of  the  returns. 

•  Pricing  derivative  securities.  Using  financial  theory,  discrepancies  between  the  predicted 
volatility  and  the  implied  volatility  can  be  translated  into  mispricings,  which  can  in  turn  be 
exploited  in  trading. 

•  Information  for  regime  switching  models.  The  predicted  volatility  can  be  an  important 
input  for  trading  models  based  on  the  “gated  experts”  architecture  (Weigend,  Mangeas  and 
Srivastava  1995).  In  this  case,  the  hidden  state  is  offered  as  an  additional  input  to  the  gate 
to  help  determine  the  current  region. 

In  summary,  we  discussed  the  signature  of  observational  noise  in  the  frequency  domain  and 
showed  on  three  data  sets  that  volatilities  exhibit  that  signature,  but  not  the  prices  or  returns.  We 
showed  that  allowing  for  a  hidden  process  with  two  or  more  degrees  of  freedom,  and  modeling  the 
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full  dynamics  of  this  process,  gives  interpretable  results  yielding  residuals  consistent  with  white 
noise.  We  are  currently  evaluating  on  several  time  horizons  the  performance  for  true  volatility  pre¬ 
dictions  of  state  space  models  in  comparison  to  historic  data  (Figlewski  1994),  GARCH  (Bollerslev 
et  al.  1995),  and  stochastic  volatility  models  (Shephard  1996). 
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Data  Splitting  on  Financial  Time  Series 

Blake  LeBaron  and  Andreas  S.  Weigend 


Abstract — This  article  exposes  problems  of  the  commonly 
used  technique  of  splitting  the  available  data  into  training, 
validation,  and  test  sets  that  are  held  fixed,  warns  about 
drawing  too  strong  conclusions  from  such  static  splits,  and 
shows  potential  pitfalls  of  ignoring  variability  across  splits. 
Using  a  bootstrap  or  resampling  method,  we  compare  the 
uncertainty  in  the  solution  stemming  from  the  data  split¬ 
ting  with  neural  network  specific  uncertainties  (parameter 
initialization,  choice  of  number  of  hidden  units,  etc.).  We 
present  two  results  on  data  from  the  New  York  Stock  Ex¬ 
change.  First,  the  variation  due  to  different  resamplings  is 
significantly  larger  than  the  variation  due  to  different  net¬ 
work  conditions.  This  result  implies  that  it  is  important  to 
not  over- interpret  a  model  (or  an  ensemble  of  models)  es¬ 
timated  on  one  specific  split  of  the  data.  Second,  on  each 
split,  the  neural  network  solution  with  early  stopping  is  very 
close  to  a  linear  model;  no  significant  nonlinearities  are  ex¬ 
tracted. 

Keywords — Model  evaluation.  Model  uncertainty.  Boot¬ 
strap.  Resampling.  Financial  forecasting.  Time  series  pre¬ 
diction.  Linear  bias  of  early  stopping.  Superposition  of 
forecasts.  Model  merging. 

Data — Dow  Jones  Industrial  Average,  1962-1987.  Volume 
from  New  York  Stock  Exchange,  1962-1987.  The  data  used 
in  this  article  is  available  from  the  web  sites  of  the  authors. 


1.  Introduction 

Training  a  network  on  a  time  series  is  not  hard,  but  once 
we  have  a  network,  how  much  can  we  trust  the  forecasts  for 
truly  new  data?  On  the  one  hand,  if  the  time  series  is  fairly 
long  (above  a  few  thousand  points),  and  if  it  is  fairly  clean 
(noise  of  less  than  one  percent  of  the  signal),  the  evaluation 
of  a  model  is  relatively  easy,  since  only  very  few  functions 
will  fit  some  held-back  data  very  well.  This  regime  can  be 
described  as  a  “right-with-probability-(l  —  e)-regime.”  On 
the  other  hand,  for  very  noisy  and/or  very  short  time  series, 
one  can  only  hope  to  be  right  on  new  data  with  a  proba¬ 
bility  of  (0.5  -he).  An  example  would  be  the  forecast  of  the 
direction  of  a  stock  price  movement.  It  is  well  known  that 
random  predictions,  or  random  trading  strategies,  can  yield 
deceptively  long  sequences  of  good  predictions  or  profitable 
trades.  In  such  noisy  problems,  many  functions  will  be  in¬ 
distinguishable  in  their  forecasting  quality.  When  connec- 
tionist  techniques  are  used,  additional  choices  (such  as  the 
architecture,  training  procedure,  and  the  random  initial¬ 
ization  of  the  network)  make  the  evaluation  even  harder. 
Evaluating  a  model  for  noisy  time  series  can  be  more  work 
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than  estimating  the  parameters. 

A  standard  procedure  for  evaluating  the  performance  of 
a  model  is  to  split  the  data  into  one  training  set  (used  for 
the  parameter  estimation,  e.g.,  through  gradient  descent  or 
second  order  methods),  one  validation  set  (used  to  deter¬ 
mine  the  stopping  point  before  overfitting  occurs,  and/or 
used  to  set  additional  parameters  or  hyperpaxameters,  such 
as  the  importance  given  to  penalize  model  complexity), 
and  one  or  more  test  sets.  This  procedure  has  been  used 
for  many  years  in  the  connectionist  community,  see  e.g., 
Weigend  et  al.  (1990).  Our  more  recent  experience  has 
found  this  approach,  along  with  conclusions  drawn  from 
it,  to  be  very  sensitive  to  the  specific  splitting  of  the  data. 
Therefore,  usual  tests  of  forecast  reliability  can  easily  be 
overly  optimistic. 

This  article  addresses  these  problems  with  a  bootstrap 
method.  The  approach  we  present  combines  the  purity  of 
splitting  the  data  into  three  disjoint  sets  with  the  power  of 
a  resampling  procedure,  giving  a  better  statistical  picture 
of  forecast  variability,  including  the  ability  to  estimate  the 
effect  of  the  randomness  of  the  splits  of  the  data  vs.  the 
randomness  of  initial  conditions  of  the  network. 

This  is  not  the  first  article  that  uses  the  bootstrap  in 
a  connectionist  context.  Weigend  et  al.  (1992)  used  the 
bootstrapping  of  residuals  to  evaluate  the  forecasting  power 
of  a  neural  net  for  exchange  rate  forecasts,  and  Connor 
(1993)  also  bootstrapped  residuals  to  obtain  error  bars  for 
the  iterated  time  series  predictions.  The  goals  were  differ¬ 
ent  from  the  goal  of  the  work  reported  here.  In  this  article 
we  resample  pairs  which  will  be  clarified  in  Section  II- A. 
Resampling  pairs  was  first  suggested  by  Efron  (1982),  and 
first  used  in  the  connectionist  community  by  Paass  (1993) 
on  the  example  of  noisy  exclusive  OR.  Tibshirani  (1996) 
applied  the  bootstrap  machinery  to  networks  in  a  cross- 
sectional  context.  However,  none  of  these  articles  evaluate 
the  effect  of  using  the  common,  simple,  static  sample  split 
on  the  performance  reliability. 

To  demonstrate  our  method,  we  wanted  to  use  a  data 
set  that  lies  somewhere  between  simple  noise-free  function 
fitting,  and  a  sequence  of  true  random  numbers  where  no 
model  has  a  chance.  We  picked  the  daily  trading  volume^ 
on  the  New  York  Stock  Exchange,  where  predictions  can 
explain  about  half  of  the  variance.  Section  II  of  this  ar¬ 
ticle  describes  the  method  and  the  data  set,  Section  III 
presents  the  empirical  results  of  the  study,  Section  IV  dis¬ 
cusses  other  sources  of  uncertainty  not  captured  by  the 
bootstrap,  and  Section  V  draws  some  conclusions. 

^Although  forecasting  prices  is  a  potentially  more  lucrative  target, 
volume  actually  is  interesting  to  the  economist  whose  goal  is  to  un¬ 
derstand  how  markets  function. 
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IL  Experimental  Design 
A.  Bootstrapping  Methodology 

Randomness  enters  naturally  in  two  ways  in  neural  net¬ 
work  modeling:  in  the  splitting  of  the  data,  and  in  choices 
about  the  network  initialization,  architecture,  and  train¬ 
ing.  A  standard  procedure  for  finding  a  good  network  is  to 
split  the  patterns  derived  from  a  time  series  into  three  sets: 
training,  validation,  and  test  sets.  The  training  set  is  used 
for  parameter  estimation  (in  simple  backpropagation,  by 
updating  the  parameter  by  gradient  descent  on  some  cost 
function).  In  order  to  avoid  overfitting,  a  common  proce¬ 
dure  is  to  use  a  network  sufficiently  large  for  the  task,  to 
monitor  (during  training)  the  performance  on  the  separate 
validation  set,  and  finally  to  choose  the  network  that  corre¬ 
sponds  to  the  minimum  on  the  validation  set,  and  employ 
it  for  future  purposes  such  as  the  evaluation  on  the  test 
set.  These  sets  have  no  patterns  in  common.  The  usual 
procedure  fixes  these  sets.  As  many  statistical  quantities 
as  desired  can  be  estimated  in  the  test  set,  but  this  leaves 
one  question  wide  open:  What  is  the  variation  in  perfor¬ 
mance  as  we  vary  training,  validation,  and  test  sets?  This 
is  an  important  question  since  real  world  problems  don^t 
come  with  a  tag  at  each  pattern  saying  how  it  should  be 
used!  Also,  if  we  were  to  only  train  one  network  on  such  a 
split,  this  would  not  tell  us  how  stable  the  performance  is 
with  respect  to  network  choices. 

Since  there  is  not  just  one  “best”  split  of  the  data  or  obvi¬ 
ous  choice  for  the  initial  weights  etc.,  we  will  vary  both  the 
data  partitions  and  network  parameters  in  order  to  find 
out  more  about  the  distributions  of  forecast  errors.  We 
use  a  computer  intensive  bootstrapping  method  to  evaluate 
the  performance,  reliability  and  robustness  of  the  connec- 
tionist  approach,  and  to  compare  it  with  linear  modeling. 
Bootstrapping  involves  generating  empirical  distributions 
for  statistics  of  interest  through  random  resampling.  We 
combine  bootstrapping  along  with  random  network  selec¬ 
tion  and  initialization. 

In  more  detail,  in  order  to  understand  the  impact  of 
the  splitting  and  network  choices,  we  draw  a  realization  of 
splits  and  network  conditions,  and  train  a  complete  model 
on  this  realization.  This  is  sometimes  called  bootstrap¬ 
ping  pairs  (Efron  &  Tibshirani,  1993),  since  the  input- 
output  pairs  or  patterns  remain  intact,  and  are  resampled 
as  full  patterns.  This  can  be  contrasted  with  training  one 
model  only,  and  resampling  the  errors  of  that  one  model 
to  obtain  a  distribution,  called  bootstrapping  residuals. 
The  latter  method  was  used  in  single-step  prediction  by 
Weigend  et  al.  (1992)  in  the  context  of  foreign  exchange 
rate  predictions.  One  model  was  built  on  one  split  of  the 
data.  Similarly,  in  an  application  to  load  forecasting,  Con¬ 
nor  (1993)  trained  one  single-step  prediction  network  on 
one  split  of  the  data,  then  resamples  from  the  empirical 
distribution  of  the  single-step  errors  and  adds  these  to  the 
inputs  in  order  to  obtain  estimates  of  the  errors  of  iter¬ 
ated  forecasts.  In  this  residuals  bootstrap,  the  residuals 
obtained  from  one  specific  model  are  used  in  rebuilding 
pairs  or  patterns  to  obtain  error  bars  reflecting  all  sources 


of  error,  including  model  misspecification.  In  contrast,  here 
we  are  interested  in  variation  due  to  sample  splits  rather 
than  error  bars.  Every  “run”  has  a  different  assignment 
between  the  sample  patterns  and  the  three  sets  which  thus 
are  different  for  each  run. 

In  the  example  used  in  this  article,  we  have  some  6200 
patterns,  each  made  up  of  a  few  past  values  of  a  number  of 
time  series  (for  details  of  how  the  patterns  are  constructed, 
see  Section  II-B  below).  We  first  build  the  test  set  by  ran¬ 
domly  picking  1500  patterns  with  replacement.  The  pat¬ 
terns  used  in  this  specific  test  set  are  then  removed  from  the 
pool.  From  the  remaining  patterns,  we  then  randomly  set 
aside  1500  patterns  as  the  validation  set  (these  are  picked 
without  replacement,  and  are  also  removed  from  the  pool). 
The  remaining  patterns  then  constitute  the  training  set.^ 
For  the  results  presented  in  the  article,  we  do  this  2523 
times,  training  a  network  each  time. 

We  use  fully  connected  feedforward  networks  with  one 
hidden  layer  of  tanh  units  and  a  linear  output  unit.  How¬ 
ever,  in  order  to  include  variations  over  reasonable  choices 
for  network  and  learning  parameters,  a  number  of  network 
characteristics  are  also  drawn  randomly  at  the  beginning 
of  each  run.^  The  cost  function  is  the  squared  difference 
between  the  network  output  and  the  target  (expressed  as 
the  log-transformed  volume,  detailed  in  the  next  section), 
summed  over  all  patterns  in  the  training  set.  Most  results 
are  given  in  terms  of  (1  -  R^),  i.e.,  one  minus  the  squared 
correlation  coefficient  between  forecast  and  target. 

B.  Data  Set 

We  use  daily  data  from  the  New  York  Stock  Ex¬ 
change  (NYSE)  from  December  3rd,  1962,  through  Septem¬ 
ber  16th,  1987,  corresponding  to  6230  days.^  Our  forecast¬ 
ing  goal  is  daily  total  trading  volume,  shown  in  Fig.  1.  We 
believe  that  this  series  has  two  interesting  features:  First, 
while  many  articles  have  tried  neural  network  approaches 
to  forecasting  prices,  few  have  attempted  forecasting  trad¬ 
ing  volume.  Second,  volume  differs  from  many  other  fi¬ 
nancial  series  in  that  it  contains  more  forecastable  struc¬ 
ture  than  typical  price  series.  We  use  the  daily  measure 
of  aggregate  turnover  on  the  NYSE  which  is  total  volume 

^There  is  no  deep  theoretical  justification  for  drawing  the  test  data 
with  replacement,  and  the  training  and  validation  set  effectively  with¬ 
out  replacement.  Our  motivation  was  to  stick  to  the  standard  rule  of 
sampling  with  replacement  for  the  test  set.  For  the  training  and  val¬ 
idation  sets,  we  did  not  allow  for  repeated  patterns  since  we  wanted 
the  linear  fit  comparison  to  be  estimated  on  each  non-test  data  point 
with  even  weight,  and  wanted  to  use  identical  sets  for  the  net  and  the 
linear  fit  in  each  run. 

^In  detail,  the  network  architecture  is  chosen  uniformly  over  2  to 
6  hidden  units.  The  learning  rate  is  chosen  uniformly  over  [1,20]  x 
10“"^,  no  momentum.  The  weight-range  w  of  the  initial  weights  is 
drawn  between  [0.25,2.5].  The  individual  weights  are  then  initialized 
randomly  from  a  uniform  distribution  over  [w/i,  —w/i]  where  i  is  the 
number  of  connections  coming  into  a  unit  (“fan-in”).  The  block-size 
(how  many  patterns  are  presented  until  the  weights  are  updated)  is 
drawn  uniformly  from  [20,180].  All  inputs  are  scaled  to  have  zero 
mean  and  unit  variance  as  estimated  over  the  entire  data  set.  No 
significant  correlation  was  found  between  performance  and  any  of 
these  choices. 

^  A  “super  test  set”  (the  period  from  September  17th  through  Octo¬ 
ber  19th,  1987  that  contains  the  1987  crash)  is  set  aside  for  some  final 
out-of-sample  forecasting  experiments,  described  in  Section  III-D. 
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Fig,  1.  (a)  The  level  of  the  Dow  Jones  Industrial  Average  from 

December  1962  -  October  1987.  (b)  The  raw  trading  volume  (ag¬ 
gregate  turnover)  on  the  NYSE,  The  nonstationarity  is  evident; 
this  is  a  semi-logarithmic  plot,  (c)  The  series  vt  that  we  use  as 
target:  it  is  obtained  by  taking  the  logarithm  of  the  raw  value 
and  dividing  it  by  the  mean  of  the  last  100  trading  days, 

divided  by  shares  outstanding,  or  the  fraction  of  shares 
traded  that  day.  This  series  is  not  stationary,  Fig.  1  (b). 
We  “detrend”  it  by  dividing  by  a  100-day  moving  average  of 
past  turnover.  In  other  words,  we  compare  the  volume  to¬ 
day  with  the  average  volume  over  the  last  100  trading  days. 
The  distribution  of  this  series  is  still  very  skewed.  We  then 
take  the  logarithm  to  obtain  a  less  skewed  distribution.^ 
We  refer  to  this  transformed  series  as  Vt-  This  target  series 
is  shown  in  Fig.  1  (c). 

Beside  three  lagged  values  of  v  (a  typical  autoregressive 
or  AR  model),  we  use  three  other  sets  of  variables,  making 
it  an  exogenous  or  ARX  model.  We  use  first  differences 
of  the  logarithm  of  the  level  of  the  Dow  Jones  Industrials 
Index  as  a  measure  of  relative  stock  returns,  r^.  Further¬ 
more,  volume  movements  axe  connected  to  stock  return 
movements  in  interesting  ways  (Karpov,  1987;  LeBaron, 
1992a;  Gallant  et  al,  1993).  One  of  these  features  is  that 
volume  is  related  to  stock  price  volatility,  sometimes  ap¬ 
proximated  by  the  absolute  magnitude  of  daily  price  move¬ 
ments.  Furthermore,  volume  tends  to  be  higher  in  rising 
markets.  For  these  reasons  we  chose  several  lagged  returns 
and  volume  variables  as  predictors.  The  predictor  vector 
(i.e.,  the  12  values  presented  to  the  network  as  inputs  for 
each  pattern)  is  given  by 

{  Vt-1,2,3  ,  7't_i,2,3  j  |l’t-l,2,3l  j  l0g(^t-l,2,3)  }  * 

^Normalizing  with  the  250-day  mean  (of  the  last  trading  year)  did 
not  remove  quite  enough  of  the  nonstationarity.  Note  also  that  we 
are  using  the  normalized  level  of  the  volume,  not  a  difference  ver¬ 
sion  that  would  correspond  to  the  change  in  volume.  Apart  from 
correcting  somewhat  for  the  skewed  distribution,  the  logarithm  can 
be  interpreted  as  emphasizing  small  values  of  the  volume  more  than 
large  ones,  and,  alternatively,  as  facilitating  product  interactions  be¬ 
tween  lagged  values  of  the  volume,  since  the  inputs  are  added  in  the 
argument  of  the  hidden  units,  and  adding  logarithms  corresponds  to 
multiplying  the  original  values. 


Here,  at  is  an  estimate  of  a  volatility.  It  is  defined  recur¬ 
sively  as 

cr'^  =  p  (T(_i  +  {1-  P)  rf  with  P  =  0.9  . 

This  represents  an  exponential  filter  of  the  squared  re¬ 
turns.  This  can  be  interpreted  in  physical  terms  as  a  re- 
laxator:  A  shock  in  rj  decays  in  -1/log^  =  9.5  days  to 
1/e  =  0.37  times  its  initial  value.®  We  initialize  htq  to  the 
unconditional  variance  of  the  series.  The  choice  of  the  ex¬ 
ponentially  smoothed  squared  returns  is  motivated  by  the 
similarity  to  variance  estimates  from  autoregressive  condi¬ 
tional  heteroskedastic  (ARCH)  models  often  used  in  finan¬ 
cial  time  series  (Bollerslev  et  al.,  1990;  Bollerslev  et  al., 
1995). 

Summarizing,  we  use  the  following  inputs  for  our  model: 

.  Three  lags  of  the  past  trading  volume,  Vt-i,2,z-  They 
are  normalized  by  the  100-day  moving  average  (but 
not  diflierenced),  see  Fig.  1  (c).  Their  one-day  autocor¬ 
relation  after  normalization  is  0.66.  (Without  our  nor¬ 
malization,  i.e.,  taking  the  raw  volume  firom  Fig.  1  (b), 
the  overall  shift  in  level  over  the  two  decades  is  respon¬ 
sible  for  an  autocorrelation  of  0.95.) 

•  Three  lags  each  of  the  relative  returns,  n-ixz-  Their 
one-day  autocorrelation  is  small  (0.135),  and  disap¬ 
pears  for  two  or  more  lags,  as  discussed  in  LeBaron 
(1992). 

•  Two  estimates  of  their  volatilities,  with  three  lags  each: 
—  Absolute  value  of  the  relative  returns,|r{_i,2,3|.  Their 

autocorrelation  coefficients  are  dropping  off  very 
slowly,  and  have  values  for  the  first  10  lags  around 
0.16,  computed  after  subtracting  the  mean  of  \rt\.^ 

-  Logarithm  of  the  exponentially  smoothed  squared  re¬ 
turns,  log((Tf_,_2,3)-  Their  one-day  autocorrelation 
is  0.975.  It  drops  off  very  slowly,  primarily  due  to 
the  smoothing  (each  value  re-enters  at  the  next  time 
step  attenuated  hy  p  =  0.9),  and  secondarily  due  to 
the  already  existing  autocorrelation  of  the  driving 
process  of  rf . 

We  refer  to  each  of  these  12-dimensional  predictor  vec¬ 
tors  with  the  associated  1-dimensional  target  value  as  a  pat¬ 
tern.  The  correlation  coefiicients  were  computed  through 
Oct  19,  1987,  i.e.,  excluding  the  effect  the  day  of  the  1987 
crash  would  have.  As  shown,  some  of  the  input  dimensions 
are  highly  correlated.  Despite  this  high  correlation  that 
gives  an  effective  overlap  of  the  patterns  in  the  three  sets, 
we  will  see  in  the  next  section  that  the  performance  varies 

®  An  equivalent  box-cart  moving  average  would  average  the  squared 
returns  over  19  days.  We  chose  the  exponential  average  since  it  does 
not  exhibit  the  box  cart’s  shadows,  i.e.,  the  effect  that  large  shocks 
show  up  again  with  the  opposite  sign  once  they  drop  off  the  left  side 
of  the  window. 

"^The  1/e  decay  time  of  the  corresponding  AR  process  is  about 
half  a  time  step.  This  does  not  characterize  the  time  scale  of  the 
underlying  process  well:  the  coefficient  is  severely  underestimated  due 
to  the  presence  of  noise  in  the  inputs  (“errors-in- variables”,  see  Fuller 
(1987)  and  Carroll  et  al.  (1995)).  Fitting  a  state  space  model  with 
full  dynamics  to  the  series  {|rt|-  <  \rt\  >}  gives  an  autoregressive 
coefficient  for  the  dynamics  of  the  state  of  0.9915,  corresponding  to 
an  1/e  decay  time  of  approximately  5  months  (117  trading  days). 
For  more  details  of  this  method  and  their  interpretation  for  modeling 
volatilities  see  Timmer  &  Weigend  (1997). 
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a  lot  for  different  random  samples  out  of  these  overlapping 
patterns. 

III.  Empirical  Results 
A.  Learning  Curves  and  Overfitting 

Fig.  2  shows  the  set  of  learning  curves®  for  a  typical 
run,  for  the  three  sets,  both  expressed  as  one  minus  the 
correlation  coefficient  squared,  (1  -  i2^),  and  as  the  mean 
squared  error  divided  by  the  overall  variance  of  the  target, 
NMSE.  Differences  between  these  two  reasonable  perfor¬ 
mance  measures  occur  when  the  mean  and  the  variances  are 
not  estimated  correctly.  Whereas  the  correlation  coefficient 
corrects  for  these  differences  (by  subtracting  the  means 
and  dividing  by  the  standard  deviations),  the  squared  error 
does  not,  and  is  thus  higher  than  (1  —  i?^). 


Learning  curves,  lines:  training  set,  x:  cross-validation  set,  o:  test  set. 


Fig.  2.  Learning  curves  of  one  specific  network  on  one  specific  split. 
They  show  the  performance  vs.  the  number  of  backpropagation 
iterations.  There  are  three  pairs  of  curves.  The  first  pair  (mono- 
tonically  decreasing)  gives  the  performance  on  the  training  set, 
the  second  pair  (denoted  by  “x”)  on  the  validation  set,  ajid  the 
third  pair  (“o”)  on  the  test  set.  The  three  solid  lines  plot  the 
(1  —  R^)  measure;  the  three  dash-dotted  lines  give  the  normal¬ 
ized  mean  squared  error  (NMSE).  The  straight  line  indicates  the 
test  error  of  a  linear  model  estimated  on  the  union  of  the  training 
set  and  the  validation  set. 

The  learning  curves  in  Fig.  2  show  performance  vs.  the 
number  of  backpropagation  iterations.  There  is  a  clear  in¬ 
crease  of  validation  and  test  errors  after  passing  through 
minima,  usually  called  overtraining  or  overfitting.  At  some 
stage  (around  epoch  800  in  this  specific  run  which  hap¬ 
pened  to  have  a  very  small  learning  rate)  the  network  ex¬ 
tracts  a  feature  of  the  training  set  that  helps  the  test  set, 
but  hurts  the  validation  set.  The  minima  of  the  validation 
set  and  the  test  set  do  not  occur  at  the  same  epoch.  From 
each  of  these  sets  of  learning  curves,  only  a  single  number 

®We  use  the  term  learning  curve  to  characterize  the  performance  as 
a  function  of  the  iterations  of  the  algorithm.  In  a  different  context, 
typically  when  an  arbitrary  number  of  training  patterns  can  be  gen¬ 
erated,  the  term  learning  curve  denotes  performance  as  a  function  of 
data  set  size. 


is  used  for  the  subsequent  analysis  and  comparisons  in  this 
article:  the  performance  value  on  the  test  set  at  that  epoch 
that  has  the  minimum  of  the  validation  set. 

B,  Linear  vs.  Nonlinear  Comparison 

One  of  the  most  important  goals  of  any  exploration  of  a 
nonlinear  forecasting  method  is  to  demonstrate  an  improve¬ 
ment  over  linear  forecasts.  For  synthetic  data,  generated 
from  nonlinear  noise- free  systems,  forecast  improvements  of 
several  orders  of  magnitude  have  been  reported:  consider 
the  celebrated  logistic  map  which  consists  only  of  a  second- 
order  component  (quadratic  term)  without  any  first-order 
(linear)  component.  It  really  should  come  as  no  surprise 
that  methods  that  allow  for  nonlinearities  will  vastly  out¬ 
perform  the  perfectly  inadequate  linear  fit  in  cases  when 
there  is  no  linear  component. 

We  here  focus  on  high-noise  real  world  data  where  the 
evaluation  is  much  harder,  and  potential  nonlinearities  are 
often  masked  by  noise.  In  this  case,  great  care  needs  to  be 
taken  to  evaluate  the  nonlinearity  of  the  model:  obtained 
on  a  single  split,  depending  on  the  split,  a  network  can  eas¬ 
ily  be  a  few  percent  better,  but  also  a  few  percent  worse 
than  the  linear  model.  Thus,  instead  of  just  comparing 
forecasts  on  one  split  and  one  out-of-sample  time  period, 
we  recommend  bootstrapping  and  reporting  the  distribu¬ 
tion  of  forecast  performance  for  both  the  network  and  lin¬ 
ear  forecasts.  This  allows  a  more  meaningful  statistical 
comparison  between  linear  and  neural  network  models. 

Solid:  neural  nets;  dashes:  linear  (2523  resamplings  each).  Dots:  697  nets  (1  sampling). 


Fig.  3.  Histograms  of  (1  —  R?)  forecast  performance.  The  solid 
line  shows  the  distribution  of  the  networks,  the  dashed  lines  of 
linear  model,  both  estimated  on  2523  different  resamplings  of  the 
available  data.  The  dotted  line  takes  just  one  split  of  the  data 
and  describes  the  distribution  of  697  networks.  The  fact  that  the 
width  of  the  dotted  histogram  is  clearly  smaller  than  the  width 
of  the  other  two  indicates  that  the  randomness  in  the  splitting 
of  the  data  generates  more  variability  than  the  randomness  in 
network  initialization  does. 

In  this  comparison  we  fit  for  each  split  a  linear  model  to 
exactly  the  same  patterns  (inputs  and  targets)  used  for  the 
network.  Parameters  are  estimated  using  the  union  of  the 
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same  tredning  and  validation  set,  and  (1  -  iZ^)  is  estimated 
over  the  same  test  set  from  each  bootstrap  resampling.® 

The  empirical  density  from  2523  bootstrap  resamplings 
of  the  forecast  performance  is  shown  in  Fig.  3.  The  solid 
line  displays  the  performance  of  the  networks,  and  the 
dashes  that  of  the  linear  models.  It  is  clear  from  this  pic¬ 
ture  that  distinguishing  between  the  two  forecasts  is  going 
to  be  difficult,  if  possible  at  all. 

Neural  net  performance  vs  linear  performance  (ratio).  [N=s2523,  mean=0.996,  std^O.016] 


Fig.  4.  Histogram  of  the  ratio  of  (1  -  B?)  network  performance  di¬ 
vided  by  (1  —  B?)AineSLV  performance  for  2523  resamplings.  Each 
entry  in  this  histogram  corresponds  to  the  performance  ratio  of 
one  network  and  one  linear  model  trained  on  one  specific  split. 

To  focus  on  the  comparison,  Fig.  4  shows  a  histogram 
of  the  run-by-run  ratio  of  the  two  forecast  performance 
measures.  This  ratio  is  estimated  for  each  of  the  bootstrap 
samples  and  recorded.  If  the  networks  were  consistently 
outperforming  the  linear  models  then  this  ratio  would  be 
less  than  unity.  However,  this  histogram  shows  that  it  is 
not  very  likely  that  the  network  will  do  better  than  a  linear 
model  in  most  cases. 

Another  perspective  on  the  correlation  of  forecast  perfor¬ 
mance  between  neural  networks  and  linear  models  is  given 
in  Fig.  5,  a  scatter  plot  of  the  performance  of  the  nonlinear 
vs.  the  nonlinear  model, 

{  (1  “  ^i)^  (1  ^nnY  }  • 

^One  referee  suggested  a  comparison  with  an  autoregressive  mov¬ 
ing  average  (ARMA)  models  instead  of  the  exogenous  autoregressive 
linear  (ARX)  we  use.  ARMA  models  are  indeed  more  general  linear 
models  than  AR  models.  However,  for  the  pairs-bootstrap  study  pre¬ 
sented  here,  where  resampling  destroys  the  sequence  of  the  patterns, 
it  is  not  possible  to  feed  back  errors  (the  MA  part).  An  ARMA  model 
cannot  be  used  in  combination  with  the  pairs-bootstrap;  ARX  is  thus 
the  appropriate  linear  model  class  to  compare  to. 

^^The  average  ratio  in  Fig.  4  is  0.996 ±0.016.  On  the  one  hand,  this 
is  significantly  diflferent  from  1,  with  a  t-statistic  of  12,6,  indicating 
a  significant,  but  small  improvement  in  overall  forecast  performance. 
On  the  other  hand,  when  we  compare  the  forecast  performance  using 
squared  forecast  errors,  we  find  that  the  average  ratio  (over  the  same 
2523  runs)  is  larger  than  unity,  1.003  ±0.020  (the  confidence  intervals 
are  statistical  errors  of  one  standard  deviation).  This  leads  us  to  the 
conclusion  that  there  is  no  relevant  difference  between  the  nets  and 
the  linear  fits. 


One  point  is  entered  for  each  bootstrap  sample,  i.  If  the 
networks  are  picking  up  much  of  the  same  structure  as  the 
linear  forecasts,  we  will  see  a  strong  correlation  between  the 
two.  This  is  indeed  the  case  in  Fig.  5  where  the  correlation 
between  forecast  errors  is  0.936. 

To  summarize  this  section:  When  we  embarked  on  this 
experiment,  we  were  hoping  for  simple  clean  evidence  for 
nonlinear  structure  in  the  volume  of  the  NYSE,  of  high  in¬ 
terest  for  economists.  What  we  found  instead  is  that  pos¬ 
sible  underlying  nonlinearities  are  not  easily  discovered — 
using  a  model  class  celebrated  for  its  ability  to  express  any 
nonlinear  function  (feedforward  networks  with  tanh  hidden 
units  with  a  squared  error  cost  function)  did  not  reveal  such 
structure.  Since  this  article  focuses  on  the  variation  due  to 
different  splitting  of  the  data,  we  did  not  use  explore  alter¬ 
natives  to  early  stopping  that  avoid  the  bias  towards  a  lin¬ 
ear  solution,  such  as  weight-elimination  or  pruning;  those 
are  interesting  experiments  and  the  data  is  available  from 
the  authors’  web  sites.  Furthermore,  we  did  not  use  com¬ 
puter  generated  nonlinear  data,  since  generating  an  arbi¬ 
trarily  large  number  of  noise-free  data  points  of  an  ergodic 
system  will  typically  (for  any  split  of  the  data)  give  very 
close  neighbors  between  the  different  sets.  This  does  not 
constitute  a  serious  test  for  the  real-world  problem  of  noisy 
data  of  finite  record  length,  perhaps  slightly  nonlinear,  that 
we  typically  find  in  economics,  finance  and  business. 

C.  Variability  Over  Random  Networks 

Our  procedure  randomizes  both  over  data  samples  and 
over  network  architectures  and  initial  parameters.  An  im¬ 
portant  question  is:  How  much  of  the  variability  is  due 
to  the  data  set  resampling,  and  how  much  is  due  to  the 
network  parameters?  Viewed  from  a  different  angle:  for 
a  given  split,  how  much  model  overfitting  would  connec- 
tionists  be  likely  to  engage  in  were  they  to  optimize  their 
network  architecture  etc.  for  that  split?  If  great  gains 
were  possible  by  tinkering  with  network  parameters  for 
each  split,  we  should  observe  a  lot  of  variability  in  forecast 
performance  over  randomly  initialized  networks  on  a  given 
data  set  split.  However,  the  dotted  line  in  Fig.  3  shows 
a  representative  density  for  697  randomly  drawn  nets,  all 
trained  and  tested  using  the  same  training,  validation,  and 
test  sets.  The  answer  to  the  question  is:  The  variations  of 
the  forecasts  due  to  changes  in  network  structure  are  small 
relative  to  the  variations  due  to  sample  splitting. 

D.  Probability  Density  of  the  Forecasts 

Now  that  we  have  an  entire  ensemble  of  neural  network 
predictors,  we  can  investigate  how  all  these  networks  can 
give  us  a  fresh  view  on  the  old  idea  of  combining  forecasts 
by  looking  at  the  scale  of  the  variations  compared  to  the 
noise  inherent  in  the  problem.  We  use  each  of  the  networks 
to  make  predictions  on  a  sample  that  had  been  set  aside 
throughout  (i.e.,  never  used  during  training,  validating  or 
testing).  The  time  period  of  this  sample  starts  immedi¬ 
ately  after  the  time  period  considered  so  far,  i.e.,  it  starts 
on  September  17th,  1987,  and  includes  the  crash  of  Octo¬ 
ber  19th,  1987,  a  day  with  unusually  large  price  movement 
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Scatter  plot:  one  point  per  run  (resampling) 


Fig.  5.  Scatter  plot  of  the  prediction  errors.  For  each  of  the  2523 
runs  (i.e.,  different  splits  of  the  data)  one  point  is  entered  into 
this  scatter  plot.  Its  location  is  determined  by  the  performance 
of  the  network  vs.  the  linear  model.  Note  that  the  point  cloud  is 
much  more  stretched  out  along  the  45>degree  line  than  orthogonal 
to  it,  again  indicating  that  there  is  more  variation  due  to  the 
randomness  in  data  splits  than  the  variation  that  results  from 
the  randomness  in  the  initial  conditions. 


and  trading  volume. 

Fig.  6  displays,  for  each  day,  the  density  of  all  the  net¬ 
work  predictions.  They  are  single-step  forecasts:  the  input 
for  tomorrow’s  prediction  contains  today’s  observed  values. 
The  solid  lines  are  the  histograms  of  the  individual  raw 
predictions;  they  have  not  been  convolved  with  any  added 
uncertainties.  The  actual  data  points  are  marked  with  x. 
(The  data  points  for  the  stock  market  crash,  October  19th 
and  2dth,  1989  are  missing  since  they  are  oS  the  scale.)  A 
few  interesting  features  are  contained  in  the  figure.  First, 
we  see  that  the  forecasts  in  many  cases  are  biased  high  or 
low,  indicating  generally  mediocre  forecast  performance. 
(Explaining  50  percent  of  the  variance  of  the  data  means 
that  there  still  remains  50  percent  unexplained!)  Second, 
the  fact  that  for  many  of  the  days  the  width  of  the  distribu¬ 
tion  is  quite  small  and  quite  far  away  from  the  actual  value 
suggests  that  even  the  smartest  selection  or  combination 
of  forecasts  cannot  yield  much  improvement.^^  Finally,  we 
can  see  how  the  models’  predictions  begin  to  spread  apart 
as  the  period  of  the  crash  is  reached.  The  main  reason 
for  this  spreading  is  that  the  inputs  wander  off  into  re- 

^^The  idea  of  combining  of  forecasts  (Bates  &  Granger,  1969),  based 
on  the  idea  that  superposition  helps  to  the  degree  that  the  errors  are 
uncorrelated,  has  recently  reached  the  connect ionist  community,  see 
e.g.,  (Jacobs,  1995).  This  article  presents,  on  a  practical  example, 
the  limitations  of  averaging  for  noisy  data:  the  empirical  densities 
show  that  averaging  over  all  the  splits  we  did  (by  taking  the  mean, 
median  or  any  convex,  possibly  even  adaptive,  combination  of  the 
1843  individual  models)  will  not  improve  the  predictions  dramatically. 


gions  where  the  network  has  never  seen  training  points. 
Regression  neural  networks  do  not  spend  any  resources  on 
modeling  the  density  of  the  inputs — moving  away  from  re¬ 
gion  of  interpolation  to  extrapolation  manifests  itself  indi¬ 
rectly  through  deteriorating  performance.  Thanks  to  the 
benevolent  nature  of  tanh  hidden  units,  the  output  remains 
bounded  even  for  thus  far  unexplored  regions. 

IV.  Relation  to  other  sources  of  uncertainty 

Forecast  uncertainty  can  come  from  many  sources.  We 
focused  on  the  uncertainty  obtained  from  the  specific  splits, 
that  can  be  called  splitting  uncertainty.  In  the  larger  pic¬ 
ture,  its  size  is  relatively  small  compared  to  all  the  noise 
sources  that  contribute  to  the  normalized  squared  error  of 
about  50  per  cent,  or  a  correlation  coefficient  of  about 

R=y/{1-  0.5)  =  0.7  . 

We  here  briefly  describe  the  effect  of  other  sources  of 
uncertainty:^^ 

•  Noisy  targets.  An  appropriately  trained  network  out¬ 
puts  expected  values.  Gradient  descent  in  a  squared 
error  cost  function  can  be  interpreted  in  a  maximum 
likelihood  framework  as  the  observed  values  being  nor¬ 
mal  distributed  around  the  predicted  values  with  con¬ 
stant  noise  level.  This  assumption  can  be  relaxed,  first 
by  allowing  a  Gaussian  with  locally  varying  widths, 
then  by  modeling  the  output  distribution  with  poten¬ 
tially  multimodal  functions: 

-  Heteroskedasticity  (“local  error  bars”).  Nix  & 

Weigend  (1995)  described  a  method  to  train  a  net¬ 
work  with  two  output  units,  the  first  giving  the  pre¬ 
diction,  the  second  the  error  bar.  Those  two  num¬ 
bers,  both  functions  of  the  input  space,  parametrize 
a  Gaussian  and  can  be  used  for  unimodal  densi¬ 
ties.  This  method  is  more  flexible  than  the  constant 
variance  assumption,  but  not  appropriate  for  multi¬ 
modal  output  densities. 

-  The  assumption  of  a  single  Gaussian  can  be  gener¬ 

alized  to  a  mixture  of  Gaussians  that  allow  predic¬ 
tion  of  more  general  densities.  Jacobs  et  al.  (1991) 
introduced  Gaussian  mixture  models  to  the  connec- 
tionist  community,  and  Weigend  et  al.  (1995)  ap¬ 
plied  them  to  time  series  prediction.  As  an  alter¬ 
native,  rather  than  using  this  mixture  of  Gaussians 
with  varying  centers,  Weigend  &  Srivastava  (1995) 
introduced  a  fuzzy-logic  like  superposition  of  tent- 
functions  at  fixed  centers  to  model  potentially  mul¬ 
timodal  densities. 

•  Noisy  inputs  (observational  noise).  This  important 
noise  source  in  autoregression  of  noisy  time  series  is 
well  known  in  statistics  and  econometrics  (see  Sec¬ 
tion  II-B)  but  less  well  known  in  the  connectionist 
community  (Weigend  et  al.j  1996).  If  the  levels  of  the 
noise  for  each  input  is  known,  the  effect  can  always  be 

Other  sources,  important  in  nonlinear  dynamical  systems  with 
low  noise,  such  as  the  divergence  of  nearby  trajectories,  are  less  im¬ 
portant  in  the  present  case  of  noisy  financial  data. 
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solid  line:  distribution  over  1843  networks.  x:  date  points 


Fig.  6,  One-day  ahead  densities  for  the  days  on  the  “super-test-set”.  The  October  1987  stock  market  crash  occurred  on  day  23  on  this  scale; 
the  value  for  the  volume  went  off  scale. 


emulated  with  a  Monte  Carlo  simulation  of  forward 
passes  with  slightly  different  values  for  the  inputs  in 
order  to  build  a  density  reflecting  input  fluctuations. 

•  Parameter  noise  (uncertainty  in  the  weights).  From 
a  Bayesian  perspective,  model  parameters  are  never 
known  exactly  but  also  have  some  uncertainty  that 
translates  into  an  uncertainty  of  the  prediction  (Bun- 
tine  &c  Weigend,  1991;  MacKay,  1992;  Neal,  1996). 
While  an  analytic  approximation  is  only  available  for 
simple  cases,  it  is  always  possible  to  obtain  a  distribu¬ 
tion  by  generating  forward  passes  through  networks 
with  slightly  diflFerent  weight  values. 

♦  Regions  of  low  input  density  (extrapolation).  At  the 
end  of  the  Section  III-D  we  discussed  the  uncertainty 
for  the  1987  crash  stemming  from  too  few  patterns 
in  the  vicinity  of  the  pattern  for  which  the  prediction 
is  to  be  made.  For  the  data  set  used  in  this  paper 
this  is  not  an  important  source  here  since  adjacent 
input  patterns  axe  highly  correlated,  implying  that  in 
most  cases  the  network  will  have  encountered  nearby 
neighbors  in  the  training  set.  This  yields,  however,  to 
an  overly  optimistic  interpretation  of  the  performance. 

The  specific  values  of  the  prediction  performance  should 
not  be  overinterpreted.  As  typically  done  in  cross-sectional 
bootstraps,  we  pick  the  validation  and  test  patterns  inter¬ 
spersed  with  the  training  data  in  order  to  obtain  an  indica¬ 
tion  of  the  variation  of  the  subsample  selection.  Care  has 
to  be  taken,  however,  in  interpreting  the  results  as  accu¬ 
rate  estimates  of  the  generalization  performance  for  truly 
future  data.  If  there  is  a  strong  overlap  from  one  pattern 
to  the  next  (imagine  a  problem  where  all  inputs  are  highly 
smoothed,  like  the  exponentially  filtered  volatility  estimate 
we  use,  or,  even  without  smoothing,  if  the  data  is  sampled 
with  a  frequency  a  lot  faster  than  the  dynamics  of  the  sys¬ 
tem!)  the  chances  are  high  that  for  a  given  test  pattern, 
there  will  be  very  similar  training  patterns  adjacent  in  time. 
In  this  case,  more  accurate  estimates  of  the  performance  on 
future  data  might  be  obtained  by  bootstrapping  blocks  of 


data  (Kunsch,  1989;  Liu  &  Singh,  1992).  Note,  however, 
that  these  blocks  are  still  taken  from  the  entire  period. 
So,  if  the  dynamics  is  truly  nonstationarity,  this  blocks- 
bootstrapping  will  still  give  overly  optimistic  results.  To 
avoid  fooling  oneself  on  financial  data,  we  strongly  recom¬ 
mend  using  only  data  for  testing  that  arrived  after  the  end 
of  the  training  and  validation  period  (whether  these  two 
are  interspersed,  blocked,  or  sequential). 

V.  Conclusions 

This  article  demonstrated  the  usefulness  of  a  pairs  boot¬ 
strap  approach  to  generating  and  testing  time  series  fore¬ 
casts.  We  then  applied  the  procedure  to  trading  volume. 
Contrary  to  our  expectation,  no  improvement  over  linear 
models  could  be  obtained  with  a  standard  network  trained 
with  backpropagation  and  regularized  by  early  stopping. 
This  does  not  rule  out  the  possibility  of  forecast  improve¬ 
ments  using  additional  forecast  variables,  or  by  using  prun¬ 
ing  and  weight-elimination  techniques. 

The  simulations  gave  us  important  insights  into  the  vari¬ 
ability  of  forecast  performance  over  changes  in  subsamples 
and  network  structure.  For  our  example,  most  of  the  vari¬ 
ability  in  forecast  performance  was  clearly  coming  from 
sample  variation  and  not  from  model  variation.  This  tells 
us  that  for  this  series  there  is  probably  little  hope  in  fine 
tuning  the  networks  we  used.  This  is  an  example  of  an 
application  where  we  feel  that  procedures  such  as  boot¬ 
strapping  are  extremely  useful  in  getting  a  clearer  picture 
of  what  might  be  real  and  what  is  noise. 
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